I often go to a site to look stuff up. I thought to myself: "Hold on. I can program. Why am I going to this site manually when I can write a piece of software that does it for me?".
And so I started. I'm using C#, so I found WebClient and Uri.
I've managed to get the source code for the site, yet the problem occurred that the specific data I'm looking for is generated via AJAX, after the source code has loaded.
So that's my problem. How can I get that code, if it needs to be requested via an AJAX call first?
The general approach is this:
using a tool like Fiddler, find out which HTTP requests are made by the browser in order to fetch the data you're looking for.
use WebClient to fetch the HTTP request(s) you need.
Take a look at my answer to this question for more info about HTML screen scraping for more details and how to work around various issues you may run across.
For #1 above, here's how to use fiddler to understand how a specific request is being made:
First, find the request you care about (the request which contains the data you want in its response). You can do this by inspecting each request by double-clicking it on the left pane in fiddler and looking inside the "text fiew" tab on the lower-right pane. You can also use CTRL+F to find content across multiple requests, but some requests are compressed so you'll want to ensure the "autodecode" button is selected in the toolbar before making your requests if you want to be sure you can text-search across all of them.
Once you've found the request you want, double-click it in Fiddler and select the "headers" tab in the upper-right pane. Those are the headers being sent. If your client sends exactly these headers to the server, you should get back the same data. But usually not all the headers are needed, so you'll want to figure out which ones are needed. You do this using Fiddler's Request Builder tab in the upper-right pane. Select that tab and drag your data request over from the left pane onto the request builder. Then submit the request to validate that it returns the correct results. Then start deleting headers, one header at a time, until the request stops working-- you know that that header was required. Try to delete each header until you find the ones that are required.
Then, you'll need to write code to generate the right header. Don't worry about the Host: header, that's generated automatically for you. For the Cookie: header, you'll need to generate it using the CookieContainer class. For the other headers (e.g. UserAgent:, Accept:, etc. you can generally copy them and add them to your request as-is.
Related
As far as I understand, the GET method asks the server to send something to the client's browser. I set up a HTTPListener in C# and when I access http://localhost:1330/form.html the request I get from the client is: GET /form.html which means that the client is saying "Hey server, I need the HTML code to display that page in the browser", which makes sense.
If I set a <form> with method=POST in form.html, the input fields values are located in the request body which is in context.Request.InputStream in C# which looks similar to this: input_name1=value&input_name2=value2&input_name3=value3... and the URL remains /form.html.
This also makes sense. The client says: "Hey server, take this data that was written in the HTML <input> elements" and the server uses it, maybe storing it in a database or computing something and send it back to the client.
Now if I set the form method to GET, the URL is modified to: /form.html?input_name1=value&input_name2=value2&input_name3=value3 and the context.Request.InputStream remains blank which is the opposite of the POST, in which the InputStream contained the data and the URL had no queries. For me, the GET method in forms doesn't make any sense. Why do we need to get the data from the form client side, send it to the server and then getting it back to client unmodified? Why do I send the data from the browser to C# and then sending it back to browser, if I can just get it client side using simple JavaScript?
In the moment the browser makes the GET request with the queries to the server, the client browser already has that data, so why does it ask the server to give it if it is already at the client's browser?
Generally speaking, an HTTP GET method is used to receive data from the server, while an HTTP POST is used to modify data or add data to a resource.
For example, think about a search form. There may be some fields on the form used to filter the results, such as SearchTerm, Start/EndDate, Category, Location, IsActive, etc, etc. You're requesting the results from the server, but not modifying any of the data. Those fields will be added to the GET request by the client so the server can filter and return the results you requested.
From the MDN article Sending form data:
Each time you want to reach a resource on the Web, the browser sends a
request to a URL. An HTTP request consists of two parts: a header that
contains a set of global metadata about the browser's capabilities,
and a body that can contain information necessary for the server to
process the specific request.
GET requests do not have a request body, so the parameters are added to the URL (this is defined in the HTTP spec, if you're interested).
The GET method is the method used by the browser to ask the server to
send back a given resource: "Hey server, I want to get this resource."
In this case, the browser sends an empty body. Because the body is
empty, if a form is sent using this method the data sent to the server
is appended to the URL.
An HTTP POST method uses the request body to add the parameters. Typically in a POST you will be adding a resource, or modifying an existing resource.
The POST method is a little different. It's the method the browser
uses to talk to the server when asking for a response that takes into
account the data provided in the body of the HTTP request: "Hey
server, take a look at this data and send me back an appropriate
result." If a form is sent using this method, the data is appended to
the body of the HTTP request.
There are plenty of resources online to learn about the HTTP protocol and HTTP verbs/methods. The MDN articles An overview of HTTP, Sending form data, and HTTP request methods should provide some good introductory reading material.
I've got an .ashx handler which, upon finishing processing will redirect to a success or error page, based on how the processing went. The handler is in my site, but the success or error pages might not be (this is something the user can configure).
Is there any way that I can pass the error details to the error page without putting it in the query string?
I've tried:
Adding a custom header that contains the error details, but since I'm using a Response.Redirect, the headers get cleared
Using Server.Transfer, instead of Response.Redirect, but this will not work for URLs not in my site
I know that I can pass data in the query string, but in some cases the data I need to pass might be too long for the query string. Do I have any other options?
Essentially, no. The only way to pass additional data in a GET request (i.e. a redirect) is to pass it in the query string.
The important thing to realise is that this is not a limitation of WebForms, this is just how HTTP works. If you're redirecting to another page that's outside of your site (and thus don't have the option of cookies/session data), you're going to have to send information directly in the request and that means using a query string.
Things like Server.Transfer and Response.Redirect are just abstractions over a simple HTTP request; no framework feature can defy how HTTP actually works.
You do, of course, have all kinds of options as to what you pass in the query string, but you're going to have to pass something. If you really want to shorten the URL, maybe you can pass an error code and expose an API that will let the receiving page fetch further information:
Store transaction information (or detailed error messages) in a database with an ID.
Pass the ID in the query string.
Expose a web method or similar API to allow the receiving page to request additional information.
There are plenty of hacky ways you could create the illusion of passing data in a redirect outside of a form post (such as returning a page containing a form and Javascript to immediately do a cross-domain form post) but the query string is the proper way of passing data in a GET request, so why try to hack around it?
If you must perform a redirect, you will need to pass some kind of information in the Query String, because that's how browser redirects work. You can be creative about how you pass it, though.
You could pass an error code, and have the consuming system know what various error codes mean.
You could pass a token, and have the consuming system know how to ask your system about the error information for the given token behind-the-scenes.
Also, if you have any flexibility around whether it's actually performing a redirect, you could use an AJAX request in the first place, and send back some kind of JSON object that the browser's javascript could interpret and send via a POST parameter or something like that.
A redirect is executed by most browsers as a GET, which means you'd have to put the data in the query string.
One trick (posted in two other answers) to do a "redirect" as a POST is to turn the response into a form that POSTs itself to the target site:
Response.Clear();
StringBuilder sb = new StringBuilder();
sb.Append("<html>");
sb.AppendFormat(#"<body onload='document.forms[""form""].submit()'>");
sb.AppendFormat("<form name='form' action='{0}' method='post'>",postbackUrl);
<!-- POST values go here -->
sb.AppendFormat("<input type='hidden' name='id' value='{0}'>", id);
sb.Append("</form>");
sb.Append("</body>");
sb.Append("</html>");
Response.Write(sb.ToString());
Response.End();
But I would read the comments on both to understand the limitations.
Basically there are two usual HTTP ways to send some data - GET and POST.
When you redirect to another URL with additional parameters, you make the client browser to send the GET request to the target server. Technically, your server responds to the browser with specific HTTP error code 307 + the URL to go (including the GET parameters).
Alternatively, you may want/need to make a POST request to the target URL. In that case you should respond with a simple HTML form, which consists of several hidden fields pre-filled with certain values. The form's action should point the target URL, method should be "POST", and of course your HTML should include javascript, which automatically submits the form once the document is loaded. This way the client browser would send the POST request instead of the GET one.
I'm trying to get the raw data sent to IIS using a HttpHandler. However, because the request is an "GET"-request without the "Content-Length" header set it reports that there is no data to read (TotalBytes), and the inputstream is empty. Is there any way I can plug into the IIS-pipeline (maybe even before the request is parsed) and just kind of take control over the request and read it's raw data? I don't care if I need to parse headers and stuff like that myself, I just want to get my hands on the actual request and tell IIS to ignore this one. Is that at all possible? Cause right now it looks like I need to do the alternative, which is developing a custom standalone server, and I really don't want to do that.
Most web servers will ignore (and rarely give you access to) the body of a GET request, because the HTTP semantics imply that it is to be ignored anyway. You should consider another method (for example POST or PUT).
See this question and the link in this answer:
HTTP GET with request body
I have Flex application requiring to filter users depending on there database groups. Depending on which group they are, the're is a config.xml file that is use to populate the swf.
Here is how I figure how to do this :
1. The client comes to a .aspx page with a form requiring a username and a password.
2. On the server side I confirm the user credential
3. Once the username/password is valid I redirect to the mxml file with the config.xml file in the html headers (post).
My problem comes when I need to get the post data from the http request. Let's say I have this code :
<mx:Application initialize="init()">
<mx:Script>
<![CDATA[
private function init():void
{
// get the post data here
}
/* More code here */
]]>
</mx:Script>
</mx:Application>
How do I get the post data on the init() function.
Thank you.
For those that would be interested, I've found some ressources on the Adobe Flex 3 Ressource center.
Basically there is no current way to pass data with the POST method. You can either add the parameters at the end of you swf url (GET method) as shown here : http://livedocs.adobe.com/flex/3/html/help.html?content=deep_linking_5.html#245869
The other way is to embed them in the page with the flashVars method shown here : http://livedocs.adobe.com/flex/3/html/help.html?content=passingarguments_3.html#229997
If you still wonder, how I'll manage to do this if you run to in the same situation. Here is my idea (feel free to share if you have different vision) :
1.User logs in login.aspx
2.Depending on the credentials of the users the server side code modify the index.html file to embed the correct xml file in the flash object.
3.With the FlashVars method, I get back the xml file path and job done!
If you ever run in a similar situation and need help contact me.
I don't think it's possible to get the POST data, but others might have a way. An alternative solution would be:
User logs in: login.aspx
User directed to Flash content: content.html embedding content.swf
Flash requests config.xml from server: content.swf makes HTTP request for config.xml.aspx
Server provides user's configuration in config.xml.aspx
In your init() function, you'd make the URLLoader request to get the configuration, and you'd do the configuration in the Event.COMPLETE handler.
Another possibility is to use HTTP cookies--not handled natively by Flash, but you can get to them via Javascript--see this CookieUtil class.
It happens that when I save a web-page source from IE it differs from source downloaded by HttpWebRequest in my C# app.
I have saved both files for reference. The one saved from IE is here and the one from HttpWebRequest is here.
They differ in formating and in the content itself. It seems that the one downloaded by HttpWebRequest is broken and doesn't consist of valid data (which is perfect when saved from IE).
I don't know why I cannot achieve a nice formated source using IE.
Reagrds
Mariusz
I suspect the one downloaded using IE has got some state associated with it from either cookies or session variables that were set when you visited the site manually. The one downloaded using C# will have the default values for everything, and hence different content.
This looks most likely because the file_web file contains a section called "LastViewedHotels" that contains an entry for the Arora Manchester.
Additionally, it looks like there is dynamic content for displaying adverts, which is different between the two files.
Usually this happens when the site you are navigating to, loads additional content via Ajax or frames.
To overcome this and always fetch the content IE sees, you can use the WebBrowser control to navigate and take the source from there.
Here is an
Example
Update
From running a KDiff on the sources you gave, it looks like there's 1 major line difference:
<link rel="alternate" type="text/html" hreflang="de"...
And that looks like it has an ID generated from a session (a cookie) so there's not much you can do about that without copying the IE cookie header.
Previous answer
"Under the hood", IE and HttpWebRequest both perform the same simple task, which is to send the following text request on port 80 via a a socket to the HTTP server:
GET / HTTP/1.1
(or 1.0 - and a host header too).
If you're on Windows you can try it out. Install the built in Windows telnet client (add/remove programs->windows features), or putty and then type:
GET / HTTP/1.1 (newline)
Host: yahoo.com
The source from this, IE, and the HttpWebRequest class will be exactly the same. The only difference will come if IE is passing cookies to the server, and any extra header which normally include:
A user agent
Accept */*
Gzip
A cookies or session variable (which includes session variables - cookies that expire when IE is closed)
For formatting, IE might turn tabs into spaces, or the other way around. The HttpWebRequest will return the raw results without any formatting.