I am trying to get the content length of the web page. example http://www.google.com.
I am using c# and below is the code I used and does not give me correct length or does it. Can some one validate please.
var request = (HttpWebRequest)WebRequest.Create("http://www.google.com.au");
request.Method = "GET";
var myResponse = request.GetResponse();
var responseLength = myResponse.ContentLength;
using (var sr = new StreamReader(myResponse.GetResponseStream(), System.Text.Encoding.UTF8))
{
var result = sr.ReadToEnd();
myResponse.Close();
}
responseLength is -1 allways but result.Length has some value, is that correct?
responseLength is -1 allways but result.Length has some value, is that correct?
Well it may be for some web sites (or some responses in some web sites) - in other cases, you'll see a non-negative value for responseLength. All you're doing is fetching the optional Content-Length HTTP header, basically... it's up to the server whether it supplies that or not.
Note that the response length, if provided, will be in bytes - whereas result.Length is in UTF-16 code units. If you want the content length in bytes, you should be reading the binary data from the stream directly rather than creating a StreamReader and reading it as text.
I think you want to DownloadString and then check the length.
Console.WriteLine(new WebClient().DownloadString("https://google.com/").Length);
Related
I need to read only the mode segment from the below response body.
grant_type=password&username=demouser&password=test123&client_id=500DWCSFS-D3C0-4135-A188-17894BABBCCF&mode=device
I used the below function to read the HTTP body and it gives me the entire body. How to chop the mode segment without using substring or changing the value in seek() : bodyStream.BaseStream.Seek(3, SeekOrigin.Begin);
var bodyStream = new StreamReader(HttpContext.Current.Request.InputStream);
bodyStream.BaseStream.Seek(0, SeekOrigin.Begin);
var bodyText = bodyStream.ReadToEnd();
You can't. HTTP uses TCP, which requires you to read the entire body anyway, you can't "seek" into a TCP stream. Well, you can, but that still reads the entire body and discards the unused pieces.
So you have to read the entire stream, and you have to meaningfully parse it, because another parameter could also contain the string "mode", and it could also be at the start, so you also can't search for &mode.
Given this is a form post, you can simply access Request.Form["mode"]. If you do want to parse it yourself:
string formData;
using (reader = new StreamReader(HttpContext.Current.Request.InputStream))
{
formData = reader.ReadToEnd();
}
var queryString = HttpUtility.ParseQueryString(formData);
var mode = queryString["mode"];
My aim is to get content from a website (for instance a league table from a sports website) and put it in a .txt file so that I can code with a local file.
I have tried multiple lines of code and others examples such as:
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
while (count > 0); // any more data to read?
}
My issue is when trying this, is that the words request and response are underlined in read and all the tokens are invalid.
Is there a better method to get content from a website to a .txt file or is there a way to fix the code supplied?
Thanks
is there a way to fix the code supplied?
The code you submitted works for me, make sure you have the proper name spaces defined.
In this case : using System.Net;
Or might it be that the duplicate creation of the variable request isn't a typo?
If so remove one of the request variables.
Is there a better method to get content from a website to a .txt file
Since you're reading all the content from the site anyway there isn't really a need for the while loop. Instead you can use the ReadToEnd method supplied by the StreamReader.
string siteContent = "";
using (StreamReader reader = new StreamReader(resStream)) {
siteContent = reader.ReadToEnd();
}
Also be sure to dispose of the WebResponse, other than that your code should work fine.
I'm trying to create a website crawler. it's going to retrieve some data from many websites.
some times if I load just 1000 first bytes of a webpage, I can see what I am looking for.
here is my code:
request = (HttpWebRequest)WebRequest.Create("http://example.com");
var response = (HttpWebResponse)request.GetResponse();
string responseString = new StreamReader(response.GetResponseStream()).ReadToEnd();
when I call request.GetResponse(), it will load whole the page (for example 4000 bytes) but the data that I'm looking for is in the first 1000 bytes. and when I call ReadToEnd(), it will read all the received data from RAM. but whole the data sent to my computer from the website! I don't want to receive all bytes, I need only N bytes of the first.
If I can do it so I save many many internet traffic.
can you help me? how can I do that?
Use StreamReader.Read, eg.
StreamReader sr = new StreamReader(response.GetResponseStream());
char[] c = new char[1000]; // 1000 bytes
sr.Read(c, 0, c.Length);
string responseString = new string(c);
I'm working in C# on a program to list all course resources for a MOOC (e.g. Coursera). I don't want to download the content, just get a listing of all the resources (e.g. pdf, videos, text files, sample files, etc...) which are made available to the course.
My problem lies in parsing the html source (currently using HtmlAgilityPack) without downloading all the content.
For example, if you go to this intro video for a banking course on Coursera and check the source (F12 in Chrome for Developer Tools), you can see the page source. I can stop the video download which autoplays, but still see the source.
How can I get the source in C# without download all the content?
I've looked in the HttpWebRequest headers (problem: time out), and DownloadDataAsync with Cancel (problem: the Completed Result object is invalid when cancelling the async request). I've also tried various Loads from HtmlAgilityPack but with no success.
Time out:
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Timeout = TIMEOUT * 1000000; //Really long
postRequest.Referer = "https://www.coursera.org";
if (headers != null)
{ //headers here }
//Deal with cookies
if (cookie != null)
{ cookieJar.Add(cookie); }
postRequest.CookieContainer = cookiejar;
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse)postRequest.GetResponse();
Any tips on how to proceed?
There are at least two ways to do what you're asking. The first is to use a range get. That is, specify the range of the file you want to read. You do that by calling AddRange on the HttpWebRequest. So if you want, say, the first 10 kilobytes of the file, you'd write:
request.AddRange(-10240);
Read carefully what the documentation says about the meaning of that parameter. If it's negative, it specifies the ending point of the range. There are also other overloads of AddRange that you might be interested in.
Not all servers support range gets, though. If that doesn't work, you'll have to do it another way.
What you can do is call GetResponse and then start reading data. Once you've read as much data as you want, you can stop reading and close the stream. I've modified your sample slightly to show what I mean.
string url = "https://www.coursera.org/course/money";
HttpWebRequest postRequest = (HttpWebRequest)WebRequest.Create(url);
postRequest.Method = "GET";
postRequest.AllowAutoRedirect = true; //allowRedirect;
postRequest.ServicePoint.Expect100Continue = true;
HttpWebResponse postResponse = (HttpWebResponse) postRequest.GetResponse();
int maxBytes = 1024*1024;
int totalBytesRead = 0;
var buffer = new byte[maxBytes];
using (var s = postResponse.GetResponseStream())
{
int bytesRead;
// read up to `maxBytes` bytes from the response
while (totalBytesRead < maxBytes && (bytesRead = s.Read(buffer, 0, maxBytes)) != 0)
{
// Here you can save the bytes read to a persistent buffer,
// or write them to a file.
Console.WriteLine("{0:N0} bytes read", bytesRead);
totalBytesRead += bytesRead;
}
}
Console.WriteLine("total bytes read = {0:N0}", totalBytesRead);
That said, I ran this sample and it downloaded about 6 kilobytes and stopped. I don't know why you're having trouble with timeouts or too much data.
Note that sometimes trying to close the stream before the entire response is read will cause the program to hang. I'm not sure why that happens at all, and I can't explain why it only happens sometimes. But you can solve it by calling request.Abort before closing the stream. That is:
using (var s = postResponse.GetResponseStream())
{
// do stuff here
// abort the request before continuing
postRequest.Abort();
}
I'm trying to obtain an image to encode to a WordML document. The original version of this function used files, but I needed to change it to get images created on the fly with an aspx page. I've adapted the code to use HttpWebRequest instead of a WebClient. The problem is that I don't think the page request is getting resolved and so the image stream is invalid, generating the error "parameter is not valid" when I invoke Image.FromStream.
public string RenderCitationTableImage(string citation_table_id)
{
string image_content = "";
string _strBaseURL = String.Format("http://{0}",
HttpContext.Current.Request.Url.GetComponents(UriComponents.HostAndPort, UriFormat.Unescaped));
string _strPageURL = String.Format("{0}{1}", _strBaseURL,
ResolveUrl("~/Publication/render_citation_chart.aspx"));
string _staticURL = String.Format("{0}{1}", _strBaseURL,
ResolveUrl("~/Images/table.gif"));
string _fullURL = String.Format("{0}?publication_id={1}&citation_table_layout_id={2}",
_strPageURL, publication_id, citation_table_id);
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(_fullURL);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream image_stream = response.GetResponseStream();
// Read the image data
MemoryStream ms = new MemoryStream();
int num_read;
byte[] crlf = System.Text.Encoding.Default.GetBytes("\r\n");
byte[] buffer = new byte[1024];
for (num_read = image_stream.Read(buffer, 0, 1024); num_read > 0; num_read = image_stream.Read(buffer, 0, 1024))
{
ms.Write(buffer, 0, num_read);
}
// Base 64 Encode the image data
byte[] image_bytes = ms.ToArray();
string encodedImage = Convert.ToBase64String(image_bytes);
ms.Position = 0;
System.Drawing.Image image_original = System.Drawing.Image.FromStream(ms); // <---error here: parameter is not valid
image_stream.Close();
image_content = string.Format("<w:p>{4}<w:r><w:pict><w:binData w:name=\"wordml://{0}\">{1}</w:binData>" +
"<v:shape style=\"width:{2}px;height:{3}px\">" +
"<v:imagedata src=\"wordml://{0}\"/>" +
"</v:shape>" +
"</w:pict></w:r></w:p>", _word_image_id, encodedImage, 800, 400, alignment.center);
image_content = "<w:br w:type=\"text-wrapping\"/>" + image_content + "<w:br w:type=\"text-wrapping\"/>";
}
catch (Exception ex)
{
return ex.ToString();
}
return image_content;
Using a static URI it works fine. If I replace "staticURL" with "fullURL" in the WebRequest.Create method I get the error. Any ideas as to why the page request doesn't fully resolve?
And yes, the full URL resolves fine and shows an image if I post it in the address bar.
UPDATE:
Just read your updated question. Since you're running into login issues, try doing this before you execute the request:
request.Credentials = CredentialCache.DefaultCredentials
If this doesn't work, then perhaps the problem is that authentication is not being enforced on static files, but is being enforced on dynamic files. In this case, you'll need to log in first (using your client code) and retain the login cookie (using HttpWebRequest.CookieContainer on the login request as well as on the second request) or turn off authentication on the page you're trying to access.
ORIGINAL:
Since it works with one HTTP URL and doesn't work with another, the place to start diagnosing this is figuring out what's different between the two requests, at the HTTP level, which accounts for the difference in behavior in your code.
To figure out the difference, I'd use Fiddler (http://fiddlertool.com) to compare the two requests. Compare the HTTP headers. Are they the same? In particular, are they the same HTTP content type? If not, that's likely the source of your problem.
If headers are the same, make sure both the static and dynamic image are exactly the same content and file type on the server. (e.g. use File...Save As to save the image in a browser to your disk). Then use Fiddler's Hex View to compare the image content. Can you see any obvious differences?
Finally, I'm sure you've already checked this, but just making sure: /Publication/render_citation_chart.aspx refers to an actual image file, not an HTML wrapper around an IMG element, right? This would account for the behavior you're seeing, where a browser renders the image OK but your code doesn't.