content from a website in a text file - c#

My aim is to get content from a website (for instance a league table from a sports website) and put it in a .txt file so that I can code with a local file.
I have tried multiple lines of code and others examples such as:
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
while (count > 0); // any more data to read?
}
My issue is when trying this, is that the words request and response are underlined in read and all the tokens are invalid.
Is there a better method to get content from a website to a .txt file or is there a way to fix the code supplied?
Thanks

is there a way to fix the code supplied?
The code you submitted works for me, make sure you have the proper name spaces defined.
In this case : using System.Net;
Or might it be that the duplicate creation of the variable request isn't a typo?
If so remove one of the request variables.
Is there a better method to get content from a website to a .txt file
Since you're reading all the content from the site anyway there isn't really a need for the while loop. Instead you can use the ReadToEnd method supplied by the StreamReader.
string siteContent = "";
using (StreamReader reader = new StreamReader(resStream)) {
siteContent = reader.ReadToEnd();
}
Also be sure to dispose of the WebResponse, other than that your code should work fine.

Related

Reading content length of website

I am trying to get the content length of the web page. example http://www.google.com.
I am using c# and below is the code I used and does not give me correct length or does it. Can some one validate please.
var request = (HttpWebRequest)WebRequest.Create("http://www.google.com.au");
request.Method = "GET";
var myResponse = request.GetResponse();
var responseLength = myResponse.ContentLength;
using (var sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8))
{
var result = sr.ReadToEnd();
myResponse.Close();
}
responseLength is -1 allways but result.Length has some value, is that correct?
responseLength is -1 allways but result.Length has some value, is that correct?
Well it may be for some web sites (or some responses in some web sites) - in other cases, you'll see a non-negative value for responseLength. All you're doing is fetching the optional Content-Length HTTP header, basically... it's up to the server whether it supplies that or not.
Note that the response length, if provided, will be in bytes - whereas result.Length is in UTF-16 code units. If you want the content length in bytes, you should be reading the binary data from the stream directly rather than creating a StreamReader and reading it as text.
I think you want to DownloadString and then check the length.
Console.WriteLine(new WebClient().DownloadString("https://google.com/").Length);

How to cancel large file download yet still get page source in C#?

I'm working in C# on a program to list all course resources for a MOOC (e.g. Coursera). I don't want to download the content, just get a listing of all the resources (e.g. pdf, videos, text files, sample files, etc...) which are made available to the course.
My problem lies in parsing the html source (currently using HtmlAgilityPack) without downloading all the content.
For example, if you go to this intro video for a banking course on Coursera and check the source (F12 in Chrome for Developer Tools), you can see the page source. I can stop the video download which autoplays, but still see the source.
How can I get the source in C# without download all the content?
I've looked in the HttpWebRequest headers (problem: time out), and DownloadDataAsync with Cancel (problem: the Completed Result object is invalid when cancelling the async request). I've also tried various Loads from HtmlAgilityPack but with no success.
Time out:
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Timeout = TIMEOUT * 1000000; //Really long
postRequest.Referer = "https://www.coursera.org";
if (headers != null)
{ //headers here }
//Deal with cookies
if (cookie != null)
{ cookieJar.Add(cookie); }
postRequest.CookieContainer = cookiejar;
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse)postRequest.GetResponse();
Any tips on how to proceed?
There are at least two ways to do what you're asking. The first is to use a range get. That is, specify the range of the file you want to read. You do that by calling AddRange on the HttpWebRequest. So if you want, say, the first 10 kilobytes of the file, you'd write:
request.AddRange(-10240);
Read carefully what the documentation says about the meaning of that parameter. If it's negative, it specifies the ending point of the range. There are also other overloads of AddRange that you might be interested in.
Not all servers support range gets, though. If that doesn't work, you'll have to do it another way.
What you can do is call GetResponse and then start reading data. Once you've read as much data as you want, you can stop reading and close the stream. I've modified your sample slightly to show what I mean.
string url = "https://www.coursera.org/course/money";
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = true; //allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse) postRequest.GetResponse();
int maxBytes = 1024*1024;
int totalBytesRead = 0;
var buffer = new byte[maxBytes];
using (var s = postResponse.GetResponseStream())
{
int bytesRead;
// read up to `maxBytes` bytes from the response
while (totalBytesRead < maxBytes && (bytesRead = s.Read(buffer, 0, maxBytes)) != 0)
{
// Here you can save the bytes read to a persistent buffer,
// or write them to a file.
Console.WriteLine("{0:N0} bytes read", bytesRead);
totalBytesRead += bytesRead;
}
}
Console.WriteLine("total bytes read = {0:N0}", totalBytesRead);
That said, I ran this sample and it downloaded about 6 kilobytes and stopped. I don't know why you're having trouble with timeouts or too much data.
Note that sometimes trying to close the stream before the entire response is read will cause the program to hang. I'm not sure why that happens at all, and I can't explain why it only happens sometimes. But you can solve it by calling request.Abort before closing the stream. That is:
using (var s = postResponse.GetResponseStream())
{
// do stuff here
// abort the request before continuing
postRequest.Abort();
}

Reversing a tsv/csv file or reading only last line using asp.net

I am pretty much stuck on a problem from last few days. I have a file while is located on a remote server can be access by using userId and password. Well no problem in accessing.
Problem is I have around 150 of them. and each of them is of variable size minimum is 2 MB and max is 3 MB.
I have to read them one by one and read last row/line data from them. I am doing it in my current code.
The main problem is it is taking too much time since it is reading files from top to bottom.
public bool TEst(string ControlId, string FileName, long offset)
{
// The serverUri parameter should use the ftp:// scheme.
// It identifies the server file that is to be downloaded
// Example: ftp://contoso.com/someFile.txt.
// The fileName parameter identifies the local file.
//The serverUri parameter identifies the remote file.
// The offset parameter specifies where in the server file to start reading data.
Uri serverUri;
String ftpserver = "ftp://xxx.xxx.xx.xxx/"+FileName;
serverUri = new Uri(ftpserver);
if (serverUri.Scheme != Uri.UriSchemeFtp)
{
return false;
}
// Get the object used to communicate with the server.
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(serverUri);
request.Credentials = new NetworkCredential("test", "test");
request.Method = WebRequestMethods.Ftp.DownloadFile;
//request.Method = WebRequestMethods.Ftp.DownloadFile;
request.ContentOffset = offset;
FtpWebResponse response = null;
try
{
response = (FtpWebResponse)request.GetResponse();
// long Size = response.ContentLength;
}
catch (WebException e)
{
Console.WriteLine(e.Status);
Console.WriteLine(e.Message);
return false;
}
// Get the data stream from the response.
Stream newFile = response.GetResponseStream();
// Use a StreamReader to simplify reading the response data.
StreamReader reader = new StreamReader(newFile);
string newFileData = reader.ReadToEnd();
// Append the response data to the local file
// using a StreamWriter.
string[] parser = newFileData.Split('\t');
string strID = parser[parser.Length - 5];
string strName = parser[parser.Length - 3];
string strStatus = parser[parser.Length-1];
if (strStatus.Trim().ToLower() != "suspect")
{
HtmlTableCell control = (HtmlTableCell)this.FindControl(ControlId);
control.InnerHtml = strName.Split('.')[0];
}
else
{
HtmlTableCell control = (HtmlTableCell)this.FindControl(ControlId);
control.InnerHtml = "S";
}
// Display the status description.
// Cleanup.
reader.Close();
response.Close();
//Console.WriteLine("Download restart - status: {0}", response.StatusDescription);
return true;
}
Threading:
protected void Page_Load(object sender, EventArgs e)
{
new Task(()=>this.TEst("controlid1", "file1.tsv", 261454)).Start();
new Task(()=>this.TEst1("controlid2", "file2.tsv", 261454)).Start();
}
FTP is not capable of seeking a file to read only the last few lines. Reference: FTP Commands You'll have to coordinate with the developers and owners of the remote ftp server and ask them make an additional file containing the data you need.
Example Ask owners of remote ftp server to create for each of the files a [filename]_lastrow file that contains the last row of the files. Your program would then operate on the [filename]_lastrow files. You'll probably be pleasantly surprised with an accommodating answer of "Ok we can do that for you"
If the ftp server can't be changed ask for a database connection.
You can also download all your files in parallel and start popping them into a queue for parsing when they are done rather than doing this process synchronously. If the ftp server can handle more connections, use as many as would be reasonable for the scenario. Parsing can be done in parallel too.
More reading: System.Threading.Tasks
It's kinda buried, but I placed a comment in your original answer. This SO question leads to this blog post which has some awesome code you can draw from.
Rather than your while loop you can skip directly to the end of the Stream by using Seek. You then want to work your way backwards though the stream until you find the first new line variable. This post should give you everything your need to know.
Get last 10 lines of very large text file > 10GB
FtpWebRequest includes the ContentOffset property. Find/choose a way to keep the offset of the last line (locally or remotely - ie by uploading a 4 byte file to ftp). This is the fastest way to do it and the most optimal for network traffic.
More information about FtpWebRequest can be found at MSDN

My post request to https://qrng.physik.hu-berlin.de/ failed, why?

the page at https://qrng.physik.hu-berlin.de/ provides a high bit rate quantum number generator web service and I'm trying to access that service.
However I could not manage to do so. This is my current code:
using System;
using System.Collections.Generic;
using System.Linq;
using S=System.Text;
using System.Security.Cryptography;
using System.IO;
namespace CS_Console_App
{
class Program
{
static void Main()
{
System.Net.ServicePointManager.Expect100Continue = false;
var username = "testuser";
var password = "testpass";
System.Diagnostics.Debug.WriteLine(Post("https://qrng.physik.hu-berlin.de/", "username="+username+"&password="+password));
Get("http://qrng.physik.hu-berlin.de/download/sampledata-1MB.bin");
}
public static void Get(string url)
{
var my_request = System.Net.WebRequest.Create(url);
my_request.Credentials = System.Net.CredentialCache.DefaultCredentials;
var my_response = my_request.GetResponse();
var my_response_stream = my_response.GetResponseStream();
var stream_reader = new System.IO.StreamReader(my_response_stream);
var content = stream_reader.ReadToEnd();
System.Diagnostics.Debug.WriteLine(content);
stream_reader.Close();
my_response_stream.Close();
}
public static string Post(string url, string data)
{
string vystup = null;
try
{
//Our postvars
byte[] buffer = System.Text.Encoding.ASCII.GetBytes(data);
//Initialisation, we use localhost, change if appliable
System.Net.HttpWebRequest WebReq = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
//Our method is post, otherwise the buffer (postvars) would be useless
WebReq.Method = "POST";
//We use form contentType, for the postvars.
WebReq.ContentType = "application/x-www-form-urlencoded";
//The length of the buffer (postvars) is used as contentlength.
WebReq.ContentLength = buffer.Length;
//We open a stream for writing the postvars
Stream PostData = WebReq.GetRequestStream();
//Now we write, and afterwards, we close. Closing is always important!
PostData.Write(buffer, 0, buffer.Length);
PostData.Close();
//Get the response handle, we have no true response yet!
System.Net.HttpWebResponse WebResp = (System.Net.HttpWebResponse)WebReq.GetResponse();
//Let's show some information about the response
Console.WriteLine(WebResp.StatusCode);
Console.WriteLine(WebResp.Server);
//Now, we read the response (the string), and output it.
Stream Answer = WebResp.GetResponseStream();
StreamReader _Answer = new StreamReader(Answer);
vystup = _Answer.ReadToEnd();
//Congratulations, you just requested your first POST page, you
//can now start logging into most login forms, with your application
//Or other examples.
}
catch (Exception ex)
{
throw ex;
}
return vystup.Trim() + "\n";
}
}
}
I'm having 403 forbidden error when I try to do a get request on http://qrng.physik.hu-berlin.de/download/sampledata-1MB.bin.
After debugging abit, I've realised that even though I've supplied a valid username and password, the response html that was sent after my POST request indicate that I was actually not logon to the system after my POST request.
Does anyone know why is this the case, and how may I work around it to call the service?
Bump. can anyone get this to work or is the site just a scam?
The site is surely not a scam. I developed the generator and I put my scientific reputation in it. The problem is that you are trying to use the service in a way that was not intended. The sample files were really only meant to be downloaded manually for basic test purposes. Automated access to fetch data into an application was meant to be implemented through the DLLs we provide.
On the other hand, I do not know of any explicit intent to prevent your implementation to work. I suppose if a web browser can log in and fetch data, some program should be able to do the same. Maybe only the login request is just a little more complicated. No idea. The server software was developed by someone else and I cannot bother him with this right now.
Mick
Actually, the generator can now also be purchased. See here:
http://www.picoquant.com/products/pqrng150/pqrng150.htm
Have you tried to change this
my_request.Credentials = System.Net.CredentialCache.DefaultCredentials
to
my_request.Credentials = new NetworkCredential(UserName,Password);
as described on MSDN page?

C# - WebRequest Doesn't Return Different Pages

Here's the purpose of my console program: Make a web request > Save results from web request > Use QueryString to get next page from web request > Save those results > Use QueryString to get next page from web request, etc.
So here's some pseudocode for how I set the code up.
for (int i = 0; i < 3; i++)
{
strPageNo = Convert.ToString(i);
//creates the url I want, with incrementing pages
strURL = "http://www.website.com/results.aspx?page=" + strPageNo;
//makes the web request
wrGETURL = WebRequest.Create(strURL);
//gets the web page for me
objStream = wrGETURL.GetResponse().GetResponseStream();
//for reading web page
objReader = new StreamReader(objStream);
//--------
// -snip- code that saves it to file, etc.
//--------
objStream.Close();
objReader.Close();
//so the server doesn't get hammered
System.Threading.Thread.Sleep(1000);
}
Pretty simple, right? The problem is, even though it increments the page number to get a different web page, I'm getting the exact same results page each time the loop runs.
i IS incrementing correctly, and I can cut/paste the url strURL creates into a web browser and it works just fine.
I can manually type in &page=1, &page=2, &page=3, and it'll return the correct pages. Somehow putting the increment in there screws it up.
Does it have anything to do with sessions, or what? I make sure I close both the stream and the reader before it loops again...
Have you tried creating a new WebRequest object for each time during the loop, it could be the Create() method isn't adequately flushing out all of its old data.
Another thing to check is that the ResponseStream is adequately flushed out before the next loop iteration.
This code works fine for me:
var urls = new [] { "http://www.google.com", "http://www.yahoo.com", "http://www.live.com" };
foreach (var url in urls)
{
WebRequest request = WebRequest.Create(url);
using (Stream responseStream = request.GetResponse().GetResponseStream())
using (Stream outputStream = new FileStream("file" + DateTime.Now.Ticks.ToString(), FileMode.Create, FileAccess.Write, FileShare.None))
{
const int chunkSize = 1024;
byte[] buffer = new byte[chunkSize];
int bytesRead;
while ((bytesRead = responseStream.Read(buffer, 0, buffer.Length)) > 0)
{
byte[] actual = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, actual, 0, bytesRead);
outputStream.Write(actual, 0, actual.Length);
}
}
Thread.Sleep(1000);
}
Just a suggestion, try disposing the Stream, and the Reader. I've seen some weird cases where not disposing objects like these and using them in loops can yield some wacky results....
That URL doesn't quite make sense to me unless you are using MVC or something that can interpret the querystring correctly.
http://www.website.com/results.aspx&page=
should be:
http://www.website.com/results.aspx?page=
Some browsers will accept poorly formed URLs and render them fine. Others may not which may be the problem with your console app.
Here's my terrible, hack-ish, workaround solution:
Make another console app that calls THIS one, in which the first console app passes an argument at the end of strURL. It works, but I feel so dirty.

Categories