C# - WebRequest Doesn't Return Different Pages

C# - WebRequest Doesn't Return Different Pages - c#

Here's the purpose of my console program: Make a web request > Save results from web request > Use QueryString to get next page from web request > Save those results > Use QueryString to get next page from web request, etc.
So here's some pseudocode for how I set the code up.
for (int i = 0; i < 3; i++)
{
strPageNo = Convert.ToString(i);
//creates the url I want, with incrementing pages
strURL = "http://www.website.com/results.aspx?page=" + strPageNo;
//makes the web request
wrGETURL = WebRequest.Create(strURL);
//gets the web page for me
objStream = wrGETURL.GetResponse().GetResponseStream();
//for reading web page
objReader = new StreamReader(objStream);
//--------
// -snip- code that saves it to file, etc.
//--------
objStream.Close();
objReader.Close();
//so the server doesn't get hammered
System.Threading.Thread.Sleep(1000);
}
Pretty simple, right? The problem is, even though it increments the page number to get a different web page, I'm getting the exact same results page each time the loop runs.
i IS incrementing correctly, and I can cut/paste the url strURL creates into a web browser and it works just fine.
I can manually type in &page=1, &page=2, &page=3, and it'll return the correct pages. Somehow putting the increment in there screws it up.
Does it have anything to do with sessions, or what? I make sure I close both the stream and the reader before it loops again...

Have you tried creating a new WebRequest object for each time during the loop, it could be the Create() method isn't adequately flushing out all of its old data.
Another thing to check is that the ResponseStream is adequately flushed out before the next loop iteration.

This code works fine for me:
var urls = new [] { "http://www.google.com", "http://www.yahoo.com", "http://www.live.com" };
foreach (var url in urls)
{
WebRequest request = WebRequest.Create(url);
using (Stream responseStream = request.GetResponse().GetResponseStream())
using (Stream outputStream = new FileStream("file" + DateTime.Now.Ticks.ToString(), FileMode.Create, FileAccess.Write, FileShare.None))
{
const int chunkSize = 1024;
byte[] buffer = new byte[chunkSize];
int bytesRead;
while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
{
byte[] actual = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, actual, 0, bytesRead);
outputStream.Write(actual, 0, actual.Length);
}
}
Thread.Sleep(1000);
}

Just a suggestion, try disposing the Stream, and the Reader. I've seen some weird cases where not disposing objects like these and using them in loops can yield some wacky results....

That URL doesn't quite make sense to me unless you are using MVC or something that can interpret the querystring correctly.
http://www.website.com/results.aspx&page=
should be:
http://www.website.com/results.aspx?page=
Some browsers will accept poorly formed URLs and render them fine. Others may not which may be the problem with your console app.

Here's my terrible, hack-ish, workaround solution:
Make another console app that calls THIS one, in which the first console app passes an argument at the end of strURL. It works, but I feel so dirty.

Related

content from a website in a text file

My aim is to get content from a website (for instance a league table from a sports website) and put it in a .txt file so that I can code with a local file.
I have tried multiple lines of code and others examples such as:
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
while (count > 0); // any more data to read?
}
My issue is when trying this, is that the words request and response are underlined in read and all the tokens are invalid.
Is there a better method to get content from a website to a .txt file or is there a way to fix the code supplied?
Thanks

is there a way to fix the code supplied?
The code you submitted works for me, make sure you have the proper name spaces defined.
In this case : using System.Net;
Or might it be that the duplicate creation of the variable request isn't a typo?
If so remove one of the request variables.
Is there a better method to get content from a website to a .txt file
Since you're reading all the content from the site anyway there isn't really a need for the while loop. Instead you can use the ReadToEnd method supplied by the StreamReader.
string siteContent = "";
using (StreamReader reader = new StreamReader(resStream)) {
siteContent = reader.ReadToEnd();
}
Also be sure to dispose of the WebResponse, other than that your code should work fine.

How to cancel large file download yet still get page source in C#?

I'm working in C# on a program to list all course resources for a MOOC (e.g. Coursera). I don't want to download the content, just get a listing of all the resources (e.g. pdf, videos, text files, sample files, etc...) which are made available to the course.
My problem lies in parsing the html source (currently using HtmlAgilityPack) without downloading all the content.
For example, if you go to this intro video for a banking course on Coursera and check the source (F12 in Chrome for Developer Tools), you can see the page source. I can stop the video download which autoplays, but still see the source.
How can I get the source in C# without download all the content?
I've looked in the HttpWebRequest headers (problem: time out), and DownloadDataAsync with Cancel (problem: the Completed Result object is invalid when cancelling the async request). I've also tried various Loads from HtmlAgilityPack but with no success.
Time out:
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Timeout = TIMEOUT * 1000000; //Really long
postRequest.Referer = "https://www.coursera.org";
if (headers != null)
{ //headers here }
//Deal with cookies
if (cookie != null)
{ cookieJar.Add(cookie); }
postRequest.CookieContainer = cookiejar;
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse)postRequest.GetResponse();
Any tips on how to proceed?

There are at least two ways to do what you're asking. The first is to use a range get. That is, specify the range of the file you want to read. You do that by calling AddRange on the HttpWebRequest. So if you want, say, the first 10 kilobytes of the file, you'd write:
request.AddRange(-10240);
Read carefully what the documentation says about the meaning of that parameter. If it's negative, it specifies the ending point of the range. There are also other overloads of AddRange that you might be interested in.
Not all servers support range gets, though. If that doesn't work, you'll have to do it another way.
What you can do is call GetResponse and then start reading data. Once you've read as much data as you want, you can stop reading and close the stream. I've modified your sample slightly to show what I mean.
string url = "https://www.coursera.org/course/money";
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = true; //allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse) postRequest.GetResponse();
int maxBytes = 1024*1024;
int totalBytesRead = 0;
var buffer = new byte[maxBytes];
using (var s = postResponse.GetResponseStream())
{
int bytesRead;
// read up to `maxBytes` bytes from the response
while (totalBytesRead < maxBytes && (bytesRead = s.Read(buffer, 0, maxBytes)) != 0)
{
// Here you can save the bytes read to a persistent buffer,
// or write them to a file.
Console.WriteLine("{0:N0} bytes read", bytesRead);
totalBytesRead += bytesRead;
}
}
Console.WriteLine("total bytes read = {0:N0}", totalBytesRead);
That said, I ran this sample and it downloaded about 6 kilobytes and stopped. I don't know why you're having trouble with timeouts or too much data.
Note that sometimes trying to close the stream before the entire response is read will cause the program to hang. I'm not sure why that happens at all, and I can't explain why it only happens sometimes. But you can solve it by calling request.Abort before closing the stream. That is:
using (var s = postResponse.GetResponseStream())
{
// do stuff here
// abort the request before continuing
postRequest.Abort();
}

HttpWebRequest Slows with multiple instances of application

trying to get to the bottom of this!
i have a very basic app that is using httpwebrequests to login, navigate to a page and then grab the html of that page. it then preforms another webrequest to a third page every 5 mins in a loop.
its all working fine and is single threaded (and fairly old), however circumstances have changed and i now need to run multiple instances of this app closely together (i have a .bat starting the app every 2seconds as a temporary measure until i am able to code a new multithreaded solution).
when the first instances of the app start everything is fine, first request is completed in ~2seconds. second one in about 3seconds.
however as more and more instances of this app are run concurrently (>100) something strange starts to happen.
the first web request still takes ~2 seconds, however the second request gets delayed much more >1min up to the point of timeout. i cant seem to think why this is. the second page is larger than the first, but nothing out of the ordinary that would take >1min to download.
The internet connection and hardware of this server is more than capable of handling these requests.
CookieContainer myContainer = new CookieContainer();
// first request is https
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(https://mysite.com/urlone);
request.CookieContainer = myContainer;
request.Proxy = proxy;
Console.WriteLine(System.DateTime.Now.ToLongTimeString() + " " + "Starting login request");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
sb.Clear();
response.Close();
resStream.Close();
string output6;
Console.WriteLine(System.DateTime.Now.ToLongTimeString() + " " + "login request comeplete");
HttpWebRequest request6 = (HttpWebRequest)WebRequest.Create(#"http://mysite.com/page2");
request6.CookieContainer = myContainer;
response = (HttpWebResponse)request6.GetResponse();
resStream = response.GetResponseStream();
tempString = null;
count = 0;
do
{
count = resStream.Read(buf, 0, buf.Length);
if (count != 0)
{
tempString = Encoding.ASCII.GetString(buf, 0, count);
sb.Append(tempString);
}
}
while (count > 0);
output6 = sb.ToString();
sb.Clear();
response.Close();
resStream.Close();
Any ideas? Im not very advanced with http web requests so if someone could check i haven't made any silly code mistakes above id appreciate it. Im at a loss as to what other information i may need to include here, if i have missed anything out please tell me and i will do my best to provide.
Thanks in advance.
EDIT 1:
I used fiddler to find out the source of the issue. It looks like the issue lies with the application (or windows) not sending the requests for some reason - the physical request actually takes < 1second according to fiddler.

Check out a few things
ServicePointManager.DefaultConnectionLimit : if you are planning to open more then 100 connection the set this value to something like 200-300.
If possible use HttpWebRequest.KeepALive = true

Try wrapping your request into a using directive to make sure it's always properly closed. If you hit the max number of connections, you otherwise have to wait for the earlier ones to time out before new ones connect.

Image URL has the contentType "text/html"

I want to implement a method to download Image from website to laptop.
public static void DownloadRemoteImageFile(string uri, string fileName)
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if ((response.StatusCode == HttpStatusCode.OK ||
response.StatusCode == HttpStatusCode.Moved ||
response.StatusCode == HttpStatusCode.Redirect) &&
response.ContentType.StartsWith("image", StringComparison.OrdinalIgnoreCase))
{
//if the remote file was found, download it
using (Stream inputStream = response.GetResponseStream())
using (Stream outputStream = File.OpenWrite(fileName))
{
byte[] buffer = new byte[4096];
int bytesRead;
do
{
bytesRead = inputStream.Read(buffer, 0, buffer.Length);
outputStream.Write(buffer, 0, bytesRead);
} while (bytesRead != 0);
}
}
}
But the ContentType of request or response is not "image/jpg" or "image/png". They're always "text/html". I think that's why after I save them to local, they has incorrect content and I cannot view them.
Can anyone has a solution here?
Thanks

Try setting the content type to specific image type
Response.ContentType = "image/jpeg";

You can use this code - based on JpegBitmapDecoder class
JpegBitmapDecoder decoder = new JpegBitmapDecoder(YourImageStreamSource, BitmapCreateOptions.PreservePixelFormat, BitmapCacheOption.Default);
//here you can adjust your YourImageStreamSource with outputStream value
BitmapSource bitmapSource = decoder.Frames[0];
Image myImage = new Image();
myImage.Source = bitmapSource;
myImage.Save("YourImage.jpg", System.Drawing.Imaging.ImageFormat.Jpeg);
Link : http://msdn.microsoft.com/en-us/library/aa970689.aspx

It may be possible that the sites you wish to get the image(s) from may need a cookie(s). Sometimes when we use our browsers to go to the site, we may not notice it, but the browser actually goes to the site for perhaps a millisecond, before quickly reloading, while at the same time getting the cookie. But before loading the site again, our browser would then pass it the cookie this time, whereby the site accepts it and returns the image.
To elaborate, this means your method would be doing only half of what your browser is actually doing. Half of 2 GET request methods. The first one would be to get the cookie, and the second one to actually get the image itself.
Information from (and maybe a bit related): C# generate a cookie dynamically that site will accept?

Your code is ok, but what you are trying to do is often considered undesired behavior by web site owners. Most sites want you to see images on the site but not download them at random. You can search for oposite of your question to know what techniqes and protections your are against to.
I strongly recommend to read usage agreement or any similar document on the site you are trying to scrape beore continuing.

Slow performance in reading from stream .NET

I have a monitoring system and I want to save a snapshot from a camera when alarm trigger.
I have tried many methods to do that…and it’s all working fine , stream snapshot from the camera then save it as a jpg in the pc…. picture (jpg format,1280*1024,140KB)..That’s fine
But my problem is in the application performance...
The app need about 20 ~30 seconds to read the steam, that’s not acceptable coz that method will be called every 2 second .I need to know what wrong with that code and how I can get it much faster than that. ?
Many thanks in advance
Code:
string sourceURL = "http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT";
byte[] buffer = new byte[200000];
int read, total = 0;
WebRequest req = (WebRequest)WebRequest.Create(sourceURL);
req.Credentials = new NetworkCredential("admin", "123456");
WebResponse resp = req.GetResponse();
Stream stream = resp.GetResponseStream();
while ((read = stream.Read(buffer, total, 1000)) != 0)
{
total += read;
}
Bitmap bmp = (Bitmap)Bitmap.FromStream(new MemoryStream(buffer, 0,total));
string path = JPGName.Text+".jpg";
bmp.Save(path);

I very much doubt that this code is the cause of the problem, at least for the first method call (but read further below).
Technically, you could produce the Bitmap without saving to a memory buffer first, or if you don't need to display the image as well, you can save the raw data without ever constructing a Bitmap, but that's not going to help in terms of multiple seconds improved performance. Have you checked how long it takes to download the image from that URL using a browser, wget, curl or whatever tool, because I suspect something is going on with the encoding source.
Something you should do is clean up your resources; close the stream properly. This can potentially cause the problem if you call this method regularly, because .NET will only open a few connections to the same host at any one point.
// Make sure the stream gets closed once we're done with it
using (Stream stream = resp.GetResponseStream())
{
// A larger buffer size would be benefitial, but it's not going
// to make a significant difference.
while ((read = stream.Read(buffer, total, 1000)) != 0)
{
total += read;
}
}

I cannot try the network behavior of the WebResponse stream, but you handle the stream twice (once in your loop and once with your memory stream).
I don't thing that's the whole problem but I'd give it a try:
string sourceURL = "http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT";
WebRequest req = (WebRequest)WebRequest.Create(sourceURL);
req.Credentials = new NetworkCredential("admin", "123456");
WebResponse resp = req.GetResponse();
Stream stream = resp.GetResponseStream();
Bitmap bmp = (Bitmap)Bitmap.FromStream(stream);
string path = JPGName.Text + ".jpg";
bmp.Save(path);

Try to read bigger pieces of data, than 1000 bytes per time. I can see no problem with, for example,
read = stream.Read(buffer, 0, buffer.Length);

Try this to download the file.
using(WebClient webClient = new WebClient())
{
webClient.DownloadFile("http://192.168.0.211/cgi-bin/cmd/encoder?SNAPSHOT", "c:\\Temp\myPic.jpg");
}
You can use a DateTime to put a unique stamp on the shot.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# - WebRequest Doesn't Return Different Pages - c#

Have you tried creating a new WebRequest object for each time during the loop, it could be the Create() method isn't adequately flushing out all of its old data. Another thing to check is that the ResponseStream is adequately flushed out before the next loop iteration.

Just a suggestion, try disposing the Stream, and the Reader. I've seen some weird cases where not disposing objects like these and using them in loops can yield some wacky results....

Here's my terrible, hack-ish, workaround solution: Make another console app that calls THIS one, in which the first console app passes an argument at the end of strURL. It works, but I feel so dirty.

Related

content from a website in a text file

How to cancel large file download yet still get page source in C#?

HttpWebRequest Slows with multiple instances of application

Image URL has the contentType "text/html"

Slow performance in reading from stream .NET

Categories

Resources