check to see if URL is a download link using webclient c# - c#

I am reading from the history database, and for every URL read, I am downloading it and storing the data into a string. I want to be able to determine if the link is a download link, i.e. .exe or .zip for e.g. I am assuming I need to read the headers to determine this, but I don't know how to do it with WebClient. Any suggestions?
while (sqlite_datareader.Read())
{
noIndex = false;
string url = (string)sqlite_datareader["url"];
try
{
if (url.Contains("http") && (!url.Contains(".pdf")) && (!url.Contains(".jpg")) && (!url.Contains("https")) && !isInBlackList(url))
{
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Only a test!");
String htmlCode = client.DownloadString(url);
}
}
}

Instead of loading the complete content behind the link, I would issue a HEAD request.
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
Quote of http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
See these questions for C# examples
How to check if a file exists on a server using c# and the WebClient class
How to check if System.Net.WebClient.DownloadData is downloading a binary file?

You're on the right track; you'll need to examine the ResponseHeaders after a successful request:
var someType = "application/zip";
if (client.ResponseHeaders["Content-Type"].Contains(someType)) {
// this was a "download link"
}
The tricky part will be in determining what constitutes a download link since there are so many content types possible. For example, how would you decide whether XML data is a download link or not?

Try to check WebClient's ResponseHeaders collections to validate response file type.

In case, anyone has the same problem, I have used an attribute in the history places.sqlite database which came in very handy!
Places.sqlite contains a table called moz_historyvisits which contains a column visit_type. According to [1], a visit_type of 7 is a download link. Therefore, reading this value will determine if it is a download link without reading the response header or even sending out a head method.
[1] http://www.firefoxforensics.com/research/moz_historyvisits.shtml

Related

How can i edit a HTTP a request C# using fiddlercore

What I want to be able to do: Edit HTTP Requests before they are sent off to the server
User navigates to a webpage of their choice in their browser > They encounter a request they wish to edit > they edit the request and then that gets sent to the server instead of the original one.
What I have done so far: I have captured the request, now I need help finding the code to edit it. Here is my code for capturing the request so far:
Fiddler.FiddlerApplication.BeforeRequest += sess =>
{
//Code to detect user specified URL here
}
Is it possible for me to edit the request before it is actually sent? If it can be done using the FiddlerCore API only then I'd be grateful, although I am willing to download more binaries if required.
Additional notes: I have tried streamwriters, binary writers, copy the respose into a memory stream edit it then copy it back, none of those methods work for me. Also when I try some methods my app just hangs and doesn't respond to things like pressing the X.
Maybe I'm just bad at explaining what I'm trying to achieve seems the only good answer I have has been about reponses :/
If the request reads the string "hello world" then I'd like the user to be able to change the REQUEST to say "hello there"
Such a noobish mistake I made, I thought that RequestBody was read only! Turns out I could have simply edited the response like this:
session.RequestBody = myBytes;
Really annoyed at myself for this!
In the demo app, adding the delegate is shown as:
Fiddler.FiddlerApplication.BeforeResponse += delegate(Fiddler.Session oS) {
// Console.WriteLine("{0}:HTTP {1} for {2}", oS.id, oS.responseCode, oS.fullUrl);
// Uncomment the following two statements to decompress/unchunk the
// HTTP response and subsequently modify any HTTP responses to replace
// instances of the word "Microsoft" with "Bayden". You MUST also
// set bBufferResponse = true inside the beforeREQUEST method above.
//
//oS.utilDecodeResponse(); oS.utilReplaceInResponse("Microsoft", "Bayden");
};

Add request headers with WebClient C#

I have the following code with which I download a web-page into a byte array and then print it with Response.Write:
WebClient client = new WebClient();
byte[] data = client.DownloadData(requestUri);
/*********** Init response headers ********/
WebHeaderCollection responseHeaders = client.ResponseHeaders;
for (int i = 0; i < responseHeaders.Count; i++)
{
Response.Headers.Add(responseHeaders.GetKey(i), responseHeaders[i]);
}
/***************************************************/
Besides of the response headers, I need to add request headers as well. I try to do it with the following code:
/*********** Init request headers ********/
NameValueCollection requestHeaders = Request.Headers;
foreach (string key in requestHeaders)
{
client.Headers.Add(key, requestHeaders[key]);
}
/***************************************************/
However it does not work and I get the following exception:
This header must be modified using the appropriate property.Parameter name: name
Could anybody help me with this? What's the correct way of adding request headers with WebClient?
Thank you.
The headers collection "protects" some of the possible headers as described on the msdn page here: http://msdn.microsoft.com/en-us/library/system.net.webclient.headers.aspx
That page seems to give all the answer you need but to quote the important part:
Some common headers are considered restricted and are protected by the
system and cannot be set or changed in a WebHeaderCollection object.
Any attempt to set one of these restricted headers in the
WebHeaderCollection object associated with a WebClient object will
throw an exception later when attempting to send the WebClient
request.
Restricted headers protected by the system include, but are not
limited to the following:
Date
Host
In addition, some other headers are also restricted when using a
WebClient object. These restricted headers include, but are not
limited to the following:
Accept
Connection
Content-Length
Expect (when the value is set to "100-continue"
If-Modified-Since
Range
Transfer-Encoding
The HttpWebRequest class has properties for setting some of the above
headers. If it is important for an application to set these headers,
then the HttpWebRequest class should be used instead of the WebRequest
class.
I suspect the reason for this is that many of the headers such as Date and host must be set differently on a different request. You should not be copying them. Indeed I would personally probably suggest that you should not be copying any of them. Put in your own user agent - If the page you are getting relies on a certain value then I'd think you want to make sure you always send a valid value rather than relying on the original user to give you that information.
Essentially work out what you need to do rather than finding something that works and doing that without fully understanding what you are doing.
Looks like you're trying to set some header which is must be set using one of the WebClient properties (CachePolicy, ContentLength or ContentType)
Moreover, it's not very good to blindly copy all the headers, you need to get just those you really need.

System.Net.HttpWebResponse Returning System.IO.Stream.NullStream

I have a case where the HttpWebResponse.GetResponseStream() returns a System.Net.NullStream even though examination of the HttpWebResponse object reveals that its underlying m_ConnectStream is an instance of System.Net.ConnectStream and the ContentLength property matches exactly the length of the content returned from the server. I also poked around in the Watch window and found my data but can't remember where I found it, but I KNOW my response data is there, the runtime just won't let me at it!
The only thing that is different from other successful scenarios is that the HttpWebRequest verb is "HEAD". I'm implementing a highly RESTful web service and wanting to use "HEAD" to request metadata for resources.
Figured it out:
Found the following .Net Fx Source Code (in HttpWebResponse Class):
/// <devdoc>
/// <para>Gets the stream used for reading the body of the response from the
/// server.</para>
/// </devdoc>
public override Stream GetResponseStream()
{
if (Logging.On)
Logging.Enter(Logging.Web, this, "GetResponseStream", "");
CheckDisposed();
if (!CanGetResponseStream()) {
// give a blank stream in the HEAD case, which = 0 bytes of data
if (Logging.On)
Logging.Exit(Logging.Web, this, "GetResponseStream",
Stream.Null);
return Stream.Null;
}
if (Logging.On)
Logging.PrintInfo(Logging.Web,
"ContentLength=" + m_ContentLength);
if (Logging.On)
Logging.Exit(Logging.Web, this, "GetResponseStream",
m_ConnectStream);
return m_ConnectStream;
}
As you can see it explicitly returns a null stream for "HEAD" requests. "Why would it do that?" I ask.
I found this at http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html:
9.4 HEAD
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
Wow. I took from the Richardson and Ruby RESTful Web Services book that you might be clever and respond to a "HEAD" request with a blank XHTML Form that would fully describe the structure of a resource's elements including requiredness, datatype, length etc. using all of the XHTML(5) form field attributes. After reading the HTTP spec, however, it is clear that all 'HEAD' response data has to go in the HTTP headers.
Oh well, you learn something new everyday ...

HttpWebRequest vs Webclient (Special scenario)

I know this question has been answered before in this thread, but I couldn't seem to find the details.
In my scenario, I am building a console application which will keep an eye on html page source for any changes. If any update/change occurs, I will perform further operations. Moreover, I'll also perform a request after every 1 second, or as soon as the previous request finishes.
I can't seem to figure out should I use HttpWebRequest or WebClient for downloading the html page source and perform comparison? What do you think would be an ideal solution in my case? Speed and reliability both :)
I'd go with HttpWebRequst because it's not as abstracted and lets you fiddle with HTTP params quite a bit. It gives you the option to not download the entire page if the server returns "file not changed", for example.
If you add some parameters to your request like IfModifiedSince (it might be HEAD or GET request) the server may return the response code 304 - NOT MODIFIED. Refer to description of caching in HTTP for further explanation.
The point is to make sure that you only download the full page when it's actually modified since the last time you fetched it. Most of the time it won't be changed (I suppose, can't know for sure without knowing your domain), so you only need to get a lightweight response from server which simply states "nothing changed here".
Update: code sample demonstrating the use of IfModifiedSince property:
bool IsResourceModified(string url, DateTime dateTime) {
try {
var request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
request.IfModifiedSince = dateTime;
request.Method = "HEAD";
var response = (HttpWebResponse)request.GetResponse();
return true;
}
catch(WebException ex) {
if(ex.Status != WebExceptionStatus.ProtocolError)
throw;
var response = (HttpWebResponse)ex.Response;
if(response.StatusCode != HttpStatusCode.NotModified)
throw;
return false;
}
}
This method should return true if the page was modifed since the dateTime date and false if it wasn't. GetResponse method will throw a WebException if you make a HEAD-request and the server returns 304 - NOT MODIFIED (which is kinda unfortunate). We have to make sure that it's not some other web connection problem, that's why I check the status of web exception and the HTTP status in response. If anything else caused an exception we just throw it further.
Console.WriteLine(IsResourceModified("http://example.com", new DateTime(2009)));
Console.WriteLine(IsResourceModified("http://example.com", DateTime.Now));
This sample code produces the output:
True
False
Note: make sure to read Jim Mischel's addition to this answer as he gives few good advices on this technique.
I was going to leave this as a comment to #Dyppl's response, but it became too long.
Dyppl's response is generally good advice, and the way that I would approach this problem. However, there are a few things you should keep in mind.
First, there's no reason to do a HEAD request, followed by a GET if the page has been modified. You can do a GET with the IfModifiedSince header set, and the server will either return the entire page or a 304. Doing the HEAD first, followed by the 'GET`, ends up making two requests to the server, which defeats much of the purpose of the conditional request.
Second, you should set the IfModifiedSince property to the LastModified value returned by the previous response (i.e. HttpWebResponse.LastModified) because the server's time might not be synchronized with your computer. Also, I've found that a large percentage of sites, particularly those with generated content (like WordPress blogs) lie. They always return the current date/time in the LastModified header. As a result, there is no benefit to doing the If-Modified-Since check on those sites.
If you know that the site lies and always returns the current date/time, you can keep track of the ContentLength header that's returned from the page when you download it. Then, when you want to check to see if the page has changed, do a HEAD request and check the returned ContentLength header with the saved value. If they match, then it's unlikely that the page has changed. If they don't match, then do a GET request to update your copy of the page and keep the new ContentLength.
This technique does have the disadvantage of requiring two requests if the page has changed. It's also not 100% reliable on all servers. Some will return a different ContentLength for the HEAD request, and some don't return a valid ContentLength at all. That said, I've found it to be effective for a large number of sites.

Determine Final Destination of a Shortened URL

I'm trying to find the best way (in code) to determine the final destination of a shortened URL. For instance http://tinyurl.com redirects to an eBay auction. I'm trying to get the URL for the eBay auction. I'm trying to do this from within .NET so I can compare multiple URLs to ensure that there is no duplicates.
TIA
While I spent a minute writing the code to ensure that it worked the answer was already delivered, but I post the code anyway:
private static string GetRealUrl(string url)
{
WebRequest request = WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Head;
WebResponse response = request.GetResponse();
return response.ResponseUri.ToString();
}
This will work as long as the short url service does a regular redirect.
You should issue a HEAD request to the url using a HttpWebRequest instance. In the returned HttpWebResponse, check the ResponseUri.
Just make sure the AllowAutoRedirect is set to true on the HttpWebRequest instance (it is true by default).
One way would be to read the URL and get the result code from it. If it's a 301 (permanent redirect) then follow where it's taking you. Continue to do this until you reach a 200 (OK). When using tinyurl it could happen that you will go through several 301 until you reach a 200.
Assuming you don't want to actually follow the link, for TinyURL you can append /info to the end of the url:
http://tinyurl.com/unicycles/info
and it gives you a page showing where that tinyurl links to, which I assume would be easy to parse using xpath or similar.
Most other URL shortening services have similar features, but they all work differently.

Categories