saving page source using web browser vs HttpWebRequest class from C# - c#

It happens that when I save a web-page source from IE it differs from source downloaded by HttpWebRequest in my C# app.
I have saved both files for reference. The one saved from IE is here and the one from HttpWebRequest is here.
They differ in formating and in the content itself. It seems that the one downloaded by HttpWebRequest is broken and doesn't consist of valid data (which is perfect when saved from IE).
I don't know why I cannot achieve a nice formated source using IE.
Reagrds
Mariusz

I suspect the one downloaded using IE has got some state associated with it from either cookies or session variables that were set when you visited the site manually. The one downloaded using C# will have the default values for everything, and hence different content.
This looks most likely because the file_web file contains a section called "LastViewedHotels" that contains an entry for the Arora Manchester.
Additionally, it looks like there is dynamic content for displaying adverts, which is different between the two files.

Usually this happens when the site you are navigating to, loads additional content via Ajax or frames.
To overcome this and always fetch the content IE sees, you can use the WebBrowser control to navigate and take the source from there.
Here is an
Example

Update
From running a KDiff on the sources you gave, it looks like there's 1 major line difference:
<link rel="alternate" type="text/html" hreflang="de"...
And that looks like it has an ID generated from a session (a cookie) so there's not much you can do about that without copying the IE cookie header.
Previous answer
"Under the hood", IE and HttpWebRequest both perform the same simple task, which is to send the following text request on port 80 via a a socket to the HTTP server:
GET / HTTP/1.1
(or 1.0 - and a host header too).
If you're on Windows you can try it out. Install the built in Windows telnet client (add/remove programs->windows features), or putty and then type:
GET / HTTP/1.1 (newline)
Host: yahoo.com
The source from this, IE, and the HttpWebRequest class will be exactly the same. The only difference will come if IE is passing cookies to the server, and any extra header which normally include:
A user agent
Accept */*
Gzip
A cookies or session variable (which includes session variables - cookies that expire when IE is closed)
For formatting, IE might turn tabs into spaces, or the other way around. The HttpWebRequest will return the raw results without any formatting.

Related

How can I download only part of a page?

I have 100 pages on my site, but I want download only part a page instead of all page content.
I want just one box of each page to download, the file size is 10 KB.
For this I Use WebClient and htmlagilitypack .
WebClient Client = new WebClient();
var result = Encoding.GetEncoding("UTF-8").GetString(Client.DownloadData(URL));
Unfortunately, that's not possible, because HTTP is not designed to deliver a specific part of a web page. It does support range requests, but for that you would need to know where exactly (in terms of bytes) the desired content is located.
You can
download the whole page and then
use a HTML parsing library to extract the part you need.
You cannot achieve this.
The only solution is changing the website structure itself. if you have control of the server -
Change the architecture of your website, making the data in the box accessible via an ajax call.
Now you can get the data via the WebClient.
If that data is already served via a API call, you can point your WebClient to that URI Instead.
Here is an example of structuring you website based on ajax -
AJAX with jQuery and ASP.NET

Use HTML string from Server Requets, and create the web page without saving it a file [in C#]

I´m sending the value of a variable via POST to a PHP page in C#. I get the data stream from the server that has all the web page in HTML with the value of the POST. This information is stored in a string variable.
I would like to open a browser and show the web page (maybe using System.Diagnostics.Process.Start("URL")), without having to save it in a file, this is showing the page in the moment and, when the browser is closed, no file is stored in the server.
Any idea?
Drop a WebBrowser control into a new form webBrowser1 and set its DocumentTextProperty to your result html
webBrowser1.DocumentText = ("<html><body>hello world</body></html>");
source:
<html><body>hello world</body></html>
You aren't going to be able to do that in an agnostic way.
If you simply wanted to open the URL in a browser, then using the Process class would work.
Unfortunately, in your case, you already have the content from creating the POST to the server, and you really want to stream that response in your application to the browser.
It's possible among the some browsers, but it's not able to be done in an agnostic way (and it's complicated even when targeting a specific browser).
To complicate matters, you want the browser to believe that the stream you are sending it is really coming from the server, when in reality, it's not.
I believe that your best bet would be to save the response to the file system in a temp file. However, before you do, add the <base> tag to the file with the URL that the file came from. This way, relative URLs will resolve correctly when rendered in the browser.
Then, just open the temporary file in the browser using the Process class.

Get html that is generated via AJAX in webclient

I often go to a site to look stuff up. I thought to myself: "Hold on. I can program. Why am I going to this site manually when I can write a piece of software that does it for me?".
And so I started. I'm using C#, so I found WebClient and Uri.
I've managed to get the source code for the site, yet the problem occurred that the specific data I'm looking for is generated via AJAX, after the source code has loaded.
So that's my problem. How can I get that code, if it needs to be requested via an AJAX call first?
The general approach is this:
using a tool like Fiddler, find out which HTTP requests are made by the browser in order to fetch the data you're looking for.
use WebClient to fetch the HTTP request(s) you need.
Take a look at my answer to this question for more info about HTML screen scraping for more details and how to work around various issues you may run across.
For #1 above, here's how to use fiddler to understand how a specific request is being made:
First, find the request you care about (the request which contains the data you want in its response). You can do this by inspecting each request by double-clicking it on the left pane in fiddler and looking inside the "text fiew" tab on the lower-right pane. You can also use CTRL+F to find content across multiple requests, but some requests are compressed so you'll want to ensure the "autodecode" button is selected in the toolbar before making your requests if you want to be sure you can text-search across all of them.
Once you've found the request you want, double-click it in Fiddler and select the "headers" tab in the upper-right pane. Those are the headers being sent. If your client sends exactly these headers to the server, you should get back the same data. But usually not all the headers are needed, so you'll want to figure out which ones are needed. You do this using Fiddler's Request Builder tab in the upper-right pane. Select that tab and drag your data request over from the left pane onto the request builder. Then submit the request to validate that it returns the correct results. Then start deleting headers, one header at a time, until the request stops working-- you know that that header was required. Try to delete each header until you find the ones that are required.
Then, you'll need to write code to generate the right header. Don't worry about the Host: header, that's generated automatically for you. For the Cookie: header, you'll need to generate it using the CookieContainer class. For the other headers (e.g. UserAgent:, Accept:, etc. you can generally copy them and add them to your request as-is.

POST data to a Flex/Flash (mxml) application

I have Flex application requiring to filter users depending on there database groups. Depending on which group they are, the're is a config.xml file that is use to populate the swf.
Here is how I figure how to do this :
1. The client comes to a .aspx page with a form requiring a username and a password.
2. On the server side I confirm the user credential
3. Once the username/password is valid I redirect to the mxml file with the config.xml file in the html headers (post).
My problem comes when I need to get the post data from the http request. Let's say I have this code :
<mx:Application initialize="init()">
<mx:Script>
<![CDATA[
private function init():void
{
// get the post data here
}
/* More code here */
]]>
</mx:Script>
</mx:Application>
How do I get the post data on the init() function.
Thank you.
For those that would be interested, I've found some ressources on the Adobe Flex 3 Ressource center.
Basically there is no current way to pass data with the POST method. You can either add the parameters at the end of you swf url (GET method) as shown here : http://livedocs.adobe.com/flex/3/html/help.html?content=deep_linking_5.html#245869
The other way is to embed them in the page with the flashVars method shown here : http://livedocs.adobe.com/flex/3/html/help.html?content=passingarguments_3.html#229997
If you still wonder, how I'll manage to do this if you run to in the same situation. Here is my idea (feel free to share if you have different vision) :
1.User logs in login.aspx
2.Depending on the credentials of the users the server side code modify the index.html file to embed the correct xml file in the flash object.
3.With the FlashVars method, I get back the xml file path and job done!
If you ever run in a similar situation and need help contact me.
I don't think it's possible to get the POST data, but others might have a way. An alternative solution would be:
User logs in: login.aspx
User directed to Flash content: content.html embedding content.swf
Flash requests config.xml from server: content.swf makes HTTP request for config.xml.aspx
Server provides user's configuration in config.xml.aspx
In your init() function, you'd make the URLLoader request to get the configuration, and you'd do the configuration in the Event.COMPLETE handler.
Another possibility is to use HTTP cookies--not handled natively by Flash, but you can get to them via Javascript--see this CookieUtil class.

How to detect if page load in newly-started browser process fails?

I use Process.Start("firefox.exe", "http://localhost/page.aspx");
And how i can know page fails or no?
OR
How to know via HttpWebRequest, HttpWebResponse page fails or not?
When i use
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("somepage.aspx");
HttpWebResponse loWebResponse = (HttpWebResponse)myReq.GetResponse();
Console.Write("{0},{1}",loWebResponse.StatusCode, loWebResponse.StatusDescription);
how can I return error details?
Not need additional plugins and frameworks. I want to choose this problem only by .net
Any Idea please
Use Watin to automate firefox instead of Process.Start. Its a browser automation framework that will let you monitor what is happening properly.
http://watin.sourceforge.net/
edit: see also Google Webdriver http://google-opensource.blogspot.com/2009/05/introducing-webdriver.html
If you are spawning a child-process, it is quite hard and you'd probably need to use each browser's specific API (it won't be the same between FF and IE, for example).
It doesn't help that in many cases the exe detects an existing instance and forwards the request there (so you can't trust the exit-code, since the page hasn't even been requested in the right exe yet).
Personally, I try to avoid assuming any particular browser for this scenario; just launch the url:
Process.Start("http://somesite.com");
This will use the user's default browser. You have to hope it appears though - you can't (reliably and robustly) check that externally without lots of work.
One other option is to read the data yourself (WebClient.Download*) - but this may have issues with complex cookies, login, user-agent awareness, etc.
Use HttpWebRequest class or WebClient class to check this. I don't think Process.Start will return something if the URL not exists.
Don't start the page in this form. Instead, create a local http://localhost:<port>/wrapper.html which loads http://localhost/page.aspx and then either http://localhost:<port>/pass.html or http://localhost:<port>/fail.html. localhost: is a trivial HTTP server interface implemented by your app.
The idea is that Javascript gives you an API inside the browser, which is far more standard than the APIs on the outside of browsers. Since the Javascript on wrapper.html comes from the same server and even port as the subsequent resources, this should satisfy the same-origin policies in current browsers.

Categories