I am trying to load processed webpage into string, but seems like it is loading javascript as well; but I want this to be "the final" result that can he saved to static html file and run offline.
This is what I am doing at this moment
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(textBox9.Text);
IWebProxy theProxy = request.Proxy;
if (theProxy != null)
{
theProxy.Credentials = CredentialCache.DefaultCredentials;
}
request.UseDefaultCredentials = true;
request.Proxy = WebRequest.DefaultWebProxy;
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
Any suggestions?
If I understand your post correctly, you don't want to strip the javascript out of the page, but keep it and make it so that it will execute just as though you had visited the page normally in a browser?
This is kind of a notoriously hard problem for proxies to overcome, and others have done it with varying degrees of success. Javascript that is embedded in the page should run just fine, but you will run into problems running any javascript that is loaded into a page from an external file.
One thing you could try is to rewrite the paths to external javascript libraries to reflect a local path, then grab copies of those javascript files over the network as well and store everything in a mimicked directory structure. Your milage may vary based on how fancy the javascript involved is, e.g. some ajax calls probably won't work no matter what you do.
Related
How can I extract all contents of a website, not only a webpage? If we consider a website named www.abc.com, how can we get all of the contents from all of the page of this site? I have tested a code but it is to get the contents of a single page of a website only using C#.
string urlAddress = "https://www.motionflix.xyz/";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (String.IsNullOrWhiteSpace(response.CharacterSet))
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream, Encoding.GetEncoding(response.CharacterSet));
string data = readStream.ReadToEnd();
Console.WriteLine(data);
response.Close();
readStream.Close();
}
Create a list containing all the URLs that have already been scraped
Create a loop that starts with a given URL, which is added to the URL list and then scrape the content of that page and search it for href tags (=new URLs). If the new URL is not in the list already repeat step 2 with this new URL. Go on as long as there are new URLs that have not been scraped yet.
Note, that you may want to check whether an URL is still on the same Domain, otherwise you might accidently scan the whole internet.
When you load that page in a browser, it will only get (server-sided browser switching aside) what you get with your request. What the browser then does and what you need to do in your code is parse this content - it contains references (e.g. via <script>, <img>, <link>, <iframe> and others) that will give the URLs of the other resources to load.
It might be easier to use a prebuilt application such as wget if it does what you need or use browser automation.
If you wants to Download a complete website including all of its contents, then you can use a software HTTrack.HTTrack allows users to download World Wide Web sites from the Internet to a local computer.Here is the link you can follow.
https://www.httrack.com/page/2/en/index.html
Hello i making a simple httpwebrequest and then i read (StreamReader) the response and just want to get the html page of website,but i get only one laber(only one element of the page) in the browser all fine(i see all page) but when i try to set cookies to Deny\disable i also in the browser get this label(only one element of the page) and all is disappear.Sow i getting opinion if after i disabled cookies in browser i get the same page(like in code) that mean my HttpWebRequest is have settings cookies=deny/disable.
You can go to https://www.bbvanetcash.com/local_kyop/KYOPSolicitarCredenciales.html and disable cookies with F12 and you will see the difrance and also this page with one label.
Sow this my code any ideas what i need to change here?
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create("https://www.bbvanetcash.com/local_kyop/KYOPSolicitarCredenciales.html");
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
Stream streamResponseLogin = myHttpWebResponse.GetResponseStream();
StreamReader streamReadLogin = new StreamReader(streamResponseLogin);
LoginInfo = streamReadLogin.ReadToEnd();
Your code is receiving complete page content, but it cannot receive the dynamic contents. This is happening because the page you are trying to access relies on Cookies for maintaining session as well as JavaScript (it is using jQuery) for loading dynamic contents and providing rich user experience.
To successfully receive the whole page, your code must
support retrieving, storing and sending cookie objects across various HttpRequest and HttpResponse.
be able to execute JavaScript code to load the dynamic contents/markup of the page
To test 'if your code is receiving proper values or not' visit the site Web Sniffer and put your URL there.
As you can try on web-sniffer site, for www.google.com, the response you are getting is a redirect instruction.... that means, even to access the Google's home page, your code must understand HTTP status messages (302 there).
Am trying to post using HTTPWebrequest and this is the response i keep getting back:
you must use a browser that supports and has JavaScript enabled
This is my post code:
HttpWebRequest myRequest = null;
myRequest = (HttpWebRequest)HttpWebRequest.Create(submitURL);
myRequest.Headers.Add("Accept-Language", "en-US");
myRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
myRequest.Method = WebRequestMethods.Http.Post;
myRequest.Headers.Add("Accept-Language", "en-US");
myRequest.Accept = "*/*, text/xml";
myRequest.ContentType = "application/x-www-form-urlencoded" + "\n" + "\r";
myRequest.CookieContainer = cookieContainer;
myRequest.Headers.Add("UA-CPU", "x86");
myRequest.Headers.Add("Accept-Encoding", "gzip, deflate");
//cPostData section removed as submitting to SO
myRequest.ContentLength = cPostData.Length;
myRequest.ServicePoint.Expect100Continue = false;
StreamWriter streamWriter = new System.IO.StreamWriter(myRequest.GetRequestStream());
streamWriter.Write(cPostData);
streamWriter.Close();
HttpWebResponse httpWebResponse = (HttpWebResponse)myRequest.GetResponse();
StreamReader streamReader = new System.IO.StreamReader(httpWebResponse.GetResponseStream());
string stringResult = streamReader.ReadToEnd();
streamReader.Close();
how do i avoid getting this error?
It is difficult to say what the exact problem is because the server that is receiving you request doesn't think it is valid.
Perhaps the first thing to try would be to set the UserAgent property on your HttpWebRequest to some valid browser's user agent string as the server may be using this value to determine whether or not to serve the page.
This doesn't have anything to do with your code - the web server code has something that detects or relies on Javascript. Most likely a piece of Javascript on the page fills out (or modifies prior to posting) some hidden form field(s).
The solution to this is entirely dependent on what the web server is expecting to happen with that form data.
This is a layman's answer, and not a 100% technically accurate description of the httpWebRequest object, and meant that way bcause of the amount of time it would take to post it. The first part of this answer is to clarify the final sentence.
The httpWebRequest object basically acts as a browser that is interacting with web pages. It's a very simple browser, with no UI. it's designed basically to be able to post to and read from web pages. As such, it does not support a variety of features normally found in a browser these days, such as JavaScript.
The page you are attempting to post to requires javascript, which the httpWebRequest object does not support. If you have no control over the page that the WebRequst object is posting to, then you'll have to find another wat to post to it. If you own or control the page, you will need to modify the page to strip out items that require javascript (such as Ajax features, etc).
Added
I purposely didn't add anything about specifying a user-agent to try to trick the web server into thinking the httpWebRequest object supports JavaScript. because it is likely that the page really needs to have JavaScript enabled in order for the page to be displayed properly. However, a lot of my assumptions prove wrong, so I would agree with #Andrew Hare and say it's worth a try.
In my asp.net-mvc application I need to include a page that shows a legacy page.
The body of this page is created by calling an existing Perl script.
This Perl script is externally hosted.
Is there a way to do something like this:
<!-- #Include virtual="http://www.example.com/theScript.plx"-->
Not as a direct include, because ASP.NET server-side-includes require the page to be compiled at the server.
You could use jQuery to download the HTML from that URL when the page loads, though I appreciate that's not perfect.
Alternatively (and I have no idea whether this will work) you could perform a WebRequest to the perl webpage from your ASP.NET MVC controller, and put the resulting HTML in the view as text. That way you could make use of things like output caching to limit the hits to the perl page if it doesn't change often.
If you wanted to do it all in one go, you could do an HTTP Request from the server and write the contents to the page?
Something like this:
Response.Write(GetHtmlPage("http://www.example.com/theScript.plx"));
Calling this method:
public String GetHtmlPage(string strURL)
{
// the html retrieved from the page
String strResult;
WebResponse objResponse;
WebRequest objRequest = System.Net.HttpWebRequest.Create(strURL);
objResponse = objRequest.GetResponse();
// the using keyword will automatically dispose the object
// once complete
using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
{
strResult = sr.ReadToEnd();
// Close and clean up the StreamReader
sr.Close();
}
return strResult;
}
(Most code ripped blatantly from here and therefore not checked)
You could implement this in a low-key fashion by simply using a frame and setting the frame source to the url that needs to be included. This is quite simple and can be down without any server or client side scripting, so that'd be my preferred approach, if possible.
If you want the html to appear to come from your server, however, you'll need to manually include it - typically by using WebRequest as Neil says. You may wish to cache the remote page for performance, though, since it's a perl script, I'll assume the page is dynamic, so this might not be a great idea.
I have a project at work the requires me to be able to enter information into a web page, read the next page I get redirected to and then take further action. A simplified real-world example would be something like going to google.com, entering "Coding tricks" as search criteria, and reading the resulting page.
Small coding examples like the ones linked to at http://www.csharp-station.com/HowTo/HttpWebFetch.aspx tell how to read a web page, but not how to interact with it by submitting information into a form and continuing on to the next page.
For the record, I'm not building a malicious and/or spam related product.
So how do I go read web pages that require a few steps of normal browsing to reach first?
You can programmatically create an Http request and retrieve the response:
string uri = "http://www.google.com/search";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
// encode the data to POST:
string postData = "q=searchterm&hl=en";
byte[] encodedData = new ASCIIEncoding().GetBytes(postData);
request.ContentLength = encodedData.Length;
Stream requestStream = request.GetRequestStream();
requestStream.Write(encodedData, 0, encodedData.Length);
// send the request and get the response
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
// Do something with the response stream. As an example, we'll
// stream the response to the console via a 256 character buffer
using (StreamReader reader = new StreamReader(response.GetResponseStream()))
{
Char[] buffer = new Char[256];
int count = reader.Read(buffer, 0, 256);
while (count > 0)
{
Console.WriteLine(new String(buffer, 0, count));
count = reader.Read(buffer, 0, 256);
}
} // reader is disposed here
} // response is disposed here
Of course, this code will return an error since Google uses GET, not POST, for search queries.
This method will work if you are dealing with specific web pages, as the URLs and POST data are all basically hard-coded. If you needed something that was a little more dynamic, you'd have to:
Capture the page
Strip out the form
Create a POST string based on the form fields
FWIW, I think something like Perl or Python might be better suited to that sort of task.
edit: x-www-form-urlencoded
You might try Selenium. Record the actions in Firefox using Selenium IDE, save the script in C# format, then play them back using the Selenium RC C# wrapper. As others have mentioned you could also use System.Net.HttpWebRequest or System.Net.WebClient. If this is a desktop application see also System.Windows.Forms.WebBrowser.
Addendum: Similar to Selenium IDE and Selenium RC, which are Java-based, WatiN Test Recorder and WatiN are .NET-based.
What you need to do is keep retrieving and analyzing the html source for each page in the chain. For each page, you need to figure out what the form submission will look like and send a request that will match that to get the next page in the chain.
What I do is build a custom class the wraps System.Net.HttpWebRequest/HttpWebResponse, so retrieving pages is as simple as using System.Net.WebClient. However, my custom class also keeps the same cookie container across requests and makes it a little easier to send post data, customize the user agent, etc.
Depending on how the website works you can either manipulate the url to perform what you want. e.g to search for the word "beatles" you could just open a request to google.com?q=beetles and then just read the results.
Alternatively if the website does not use querystring values (url) to process page actions then you will need to work on a webrequest which posts the required values to the website instead. Search in Google for working with WebRequest and webresponse.