HttpWebRequest vs Webclient (Special scenario) - c#

I know this question has been answered before in this thread, but I couldn't seem to find the details.
In my scenario, I am building a console application which will keep an eye on html page source for any changes. If any update/change occurs, I will perform further operations. Moreover, I'll also perform a request after every 1 second, or as soon as the previous request finishes.
I can't seem to figure out should I use HttpWebRequest or WebClient for downloading the html page source and perform comparison? What do you think would be an ideal solution in my case? Speed and reliability both :)

I'd go with HttpWebRequst because it's not as abstracted and lets you fiddle with HTTP params quite a bit. It gives you the option to not download the entire page if the server returns "file not changed", for example.
If you add some parameters to your request like IfModifiedSince (it might be HEAD or GET request) the server may return the response code 304 - NOT MODIFIED. Refer to description of caching in HTTP for further explanation.
The point is to make sure that you only download the full page when it's actually modified since the last time you fetched it. Most of the time it won't be changed (I suppose, can't know for sure without knowing your domain), so you only need to get a lightweight response from server which simply states "nothing changed here".
Update: code sample demonstrating the use of IfModifiedSince property:
bool IsResourceModified(string url, DateTime dateTime) {
try {
var request = (HttpWebRequest)HttpWebRequest.Create(new Uri(url));
request.IfModifiedSince = dateTime;
request.Method = "HEAD";
var response = (HttpWebResponse)request.GetResponse();
return true;
}
catch(WebException ex) {
if(ex.Status != WebExceptionStatus.ProtocolError)
throw;
var response = (HttpWebResponse)ex.Response;
if(response.StatusCode != HttpStatusCode.NotModified)
throw;
return false;
}
}
This method should return true if the page was modifed since the dateTime date and false if it wasn't. GetResponse method will throw a WebException if you make a HEAD-request and the server returns 304 - NOT MODIFIED (which is kinda unfortunate). We have to make sure that it's not some other web connection problem, that's why I check the status of web exception and the HTTP status in response. If anything else caused an exception we just throw it further.
Console.WriteLine(IsResourceModified("http://example.com", new DateTime(2009)));
Console.WriteLine(IsResourceModified("http://example.com", DateTime.Now));
This sample code produces the output:
True
False
Note: make sure to read Jim Mischel's addition to this answer as he gives few good advices on this technique.

I was going to leave this as a comment to #Dyppl's response, but it became too long.
Dyppl's response is generally good advice, and the way that I would approach this problem. However, there are a few things you should keep in mind.
First, there's no reason to do a HEAD request, followed by a GET if the page has been modified. You can do a GET with the IfModifiedSince header set, and the server will either return the entire page or a 304. Doing the HEAD first, followed by the 'GET`, ends up making two requests to the server, which defeats much of the purpose of the conditional request.
Second, you should set the IfModifiedSince property to the LastModified value returned by the previous response (i.e. HttpWebResponse.LastModified) because the server's time might not be synchronized with your computer. Also, I've found that a large percentage of sites, particularly those with generated content (like WordPress blogs) lie. They always return the current date/time in the LastModified header. As a result, there is no benefit to doing the If-Modified-Since check on those sites.
If you know that the site lies and always returns the current date/time, you can keep track of the ContentLength header that's returned from the page when you download it. Then, when you want to check to see if the page has changed, do a HEAD request and check the returned ContentLength header with the saved value. If they match, then it's unlikely that the page has changed. If they don't match, then do a GET request to update your copy of the page and keep the new ContentLength.
This technique does have the disadvantage of requiring two requests if the page has changed. It's also not 100% reliable on all servers. Some will return a different ContentLength for the HEAD request, and some don't return a valid ContentLength at all. That said, I've found it to be effective for a large number of sites.

Related

How can i edit a HTTP a request C# using fiddlercore

What I want to be able to do: Edit HTTP Requests before they are sent off to the server
User navigates to a webpage of their choice in their browser > They encounter a request they wish to edit > they edit the request and then that gets sent to the server instead of the original one.
What I have done so far: I have captured the request, now I need help finding the code to edit it. Here is my code for capturing the request so far:
Fiddler.FiddlerApplication.BeforeRequest += sess =>
{
//Code to detect user specified URL here
}
Is it possible for me to edit the request before it is actually sent? If it can be done using the FiddlerCore API only then I'd be grateful, although I am willing to download more binaries if required.
Additional notes: I have tried streamwriters, binary writers, copy the respose into a memory stream edit it then copy it back, none of those methods work for me. Also when I try some methods my app just hangs and doesn't respond to things like pressing the X.
Maybe I'm just bad at explaining what I'm trying to achieve seems the only good answer I have has been about reponses :/
If the request reads the string "hello world" then I'd like the user to be able to change the REQUEST to say "hello there"
Such a noobish mistake I made, I thought that RequestBody was read only! Turns out I could have simply edited the response like this:
session.RequestBody = myBytes;
Really annoyed at myself for this!
In the demo app, adding the delegate is shown as:
Fiddler.FiddlerApplication.BeforeResponse += delegate(Fiddler.Session oS) {
// Console.WriteLine("{0}:HTTP {1} for {2}", oS.id, oS.responseCode, oS.fullUrl);
// Uncomment the following two statements to decompress/unchunk the
// HTTP response and subsequently modify any HTTP responses to replace
// instances of the word "Microsoft" with "Bayden". You MUST also
// set bBufferResponse = true inside the beforeREQUEST method above.
//
//oS.utilDecodeResponse(); oS.utilReplaceInResponse("Microsoft", "Bayden");
};

System.Net.HttpWebResponse Returning System.IO.Stream.NullStream

I have a case where the HttpWebResponse.GetResponseStream() returns a System.Net.NullStream even though examination of the HttpWebResponse object reveals that its underlying m_ConnectStream is an instance of System.Net.ConnectStream and the ContentLength property matches exactly the length of the content returned from the server. I also poked around in the Watch window and found my data but can't remember where I found it, but I KNOW my response data is there, the runtime just won't let me at it!
The only thing that is different from other successful scenarios is that the HttpWebRequest verb is "HEAD". I'm implementing a highly RESTful web service and wanting to use "HEAD" to request metadata for resources.
Figured it out:
Found the following .Net Fx Source Code (in HttpWebResponse Class):
/// <devdoc>
/// <para>Gets the stream used for reading the body of the response from the
/// server.</para>
/// </devdoc>
public override Stream GetResponseStream()
{
if (Logging.On)
Logging.Enter(Logging.Web, this, "GetResponseStream", "");
CheckDisposed();
if (!CanGetResponseStream()) {
// give a blank stream in the HEAD case, which = 0 bytes of data
if (Logging.On)
Logging.Exit(Logging.Web, this, "GetResponseStream",
Stream.Null);
return Stream.Null;
}
if (Logging.On)
Logging.PrintInfo(Logging.Web,
"ContentLength=" + m_ContentLength);
if (Logging.On)
Logging.Exit(Logging.Web, this, "GetResponseStream",
m_ConnectStream);
return m_ConnectStream;
}
As you can see it explicitly returns a null stream for "HEAD" requests. "Why would it do that?" I ask.
I found this at http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html:
9.4 HEAD
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
Wow. I took from the Richardson and Ruby RESTful Web Services book that you might be clever and respond to a "HEAD" request with a blank XHTML Form that would fully describe the structure of a resource's elements including requiredness, datatype, length etc. using all of the XHTML(5) form field attributes. After reading the HTTP spec, however, it is clear that all 'HEAD' response data has to go in the HTTP headers.
Oh well, you learn something new everyday ...

check to see if URL is a download link using webclient c#

I am reading from the history database, and for every URL read, I am downloading it and storing the data into a string. I want to be able to determine if the link is a download link, i.e. .exe or .zip for e.g. I am assuming I need to read the headers to determine this, but I don't know how to do it with WebClient. Any suggestions?
while (sqlite_datareader.Read())
{
noIndex = false;
string url = (string)sqlite_datareader["url"];
try
{
if (url.Contains("http") && (!url.Contains(".pdf")) && (!url.Contains(".jpg")) && (!url.Contains("https")) && !isInBlackList(url))
{
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Only a test!");
String htmlCode = client.DownloadString(url);
}
}
}
Instead of loading the complete content behind the link, I would issue a HEAD request.
The HEAD method is identical to GET except that the server MUST NOT return a message-body in the response. The metainformation contained in the HTTP headers in response to a HEAD request SHOULD be identical to the information sent in response to a GET request. This method can be used for obtaining metainformation about the entity implied by the request without transferring the entity-body itself. This method is often used for testing hypertext links for validity, accessibility, and recent modification.
Quote of http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
See these questions for C# examples
How to check if a file exists on a server using c# and the WebClient class
How to check if System.Net.WebClient.DownloadData is downloading a binary file?
You're on the right track; you'll need to examine the ResponseHeaders after a successful request:
var someType = "application/zip";
if (client.ResponseHeaders["Content-Type"].Contains(someType)) {
// this was a "download link"
}
The tricky part will be in determining what constitutes a download link since there are so many content types possible. For example, how would you decide whether XML data is a download link or not?
Try to check WebClient's ResponseHeaders collections to validate response file type.
In case, anyone has the same problem, I have used an attribute in the history places.sqlite database which came in very handy!
Places.sqlite contains a table called moz_historyvisits which contains a column visit_type. According to [1], a visit_type of 7 is a download link. Therefore, reading this value will determine if it is a download link without reading the response header or even sending out a head method.
[1] http://www.firefoxforensics.com/research/moz_historyvisits.shtml

HttpWebRequest.Address vs HttpWebResponse.ResponseUri

Whats the difference between these two properties?
To put into context, I am determining if a redirect occurs if our ResponseUri != RequestUri.
While a redirect occurs regardless the url http://adage.com/adages/article?article_id=140560 will provide a different ResponseUri (http://adage.com/adages/post.php) than the Address (http://adage.com/adages/post?article_id=140560).
It appears the ResponseUri takes the Content-Location header and uses it while the Address maintains the correct location.
Would it be correct to compare the RequestUri to the HttpWebRequest.Address to check for redirects?
Yes, comparing request.RequestUri and request.Address is the way to go. At least in Mono response.ResponseUri is the same as request.Address.
I know this is an old question, but I found it while researching this topic and noticed it wasn't actually answered correctly.
While HttpWebRequest.Address and HttpWebResponse.ResponseUri should always be the same, here is the difference:
HttpWebResponse.Address will return the Uri of the page actually responding
HttpWebResponse.ResponseUri will return the value of the Content-Location header (if present). While the documentation doesn't explicitly state what happens if the Content-Location header is not present, it is assumed it will use the same value as Address.
Remember HTTP headers can be forged, so Microsoft recommends using Address instead of ResponseUri for security reasons.
http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.responseuri.aspx
Have you thought about setting request.AllowAutoRedirect=false and then reissuing the request on a redirect?
The Uri comparison should also work fine, although I am not sure of all the edge cases

Determine Final Destination of a Shortened URL

I'm trying to find the best way (in code) to determine the final destination of a shortened URL. For instance http://tinyurl.com redirects to an eBay auction. I'm trying to get the URL for the eBay auction. I'm trying to do this from within .NET so I can compare multiple URLs to ensure that there is no duplicates.
TIA
While I spent a minute writing the code to ensure that it worked the answer was already delivered, but I post the code anyway:
private static string GetRealUrl(string url)
{
WebRequest request = WebRequest.Create(url);
request.Method = WebRequestMethods.Http.Head;
WebResponse response = request.GetResponse();
return response.ResponseUri.ToString();
}
This will work as long as the short url service does a regular redirect.
You should issue a HEAD request to the url using a HttpWebRequest instance. In the returned HttpWebResponse, check the ResponseUri.
Just make sure the AllowAutoRedirect is set to true on the HttpWebRequest instance (it is true by default).
One way would be to read the URL and get the result code from it. If it's a 301 (permanent redirect) then follow where it's taking you. Continue to do this until you reach a 200 (OK). When using tinyurl it could happen that you will go through several 301 until you reach a 200.
Assuming you don't want to actually follow the link, for TinyURL you can append /info to the end of the url:
http://tinyurl.com/unicycles/info
and it gives you a page showing where that tinyurl links to, which I assume would be easy to parse using xpath or similar.
Most other URL shortening services have similar features, but they all work differently.

Categories