Getting and parsing webpage from Internet on C# or C++

Getting and parsing webpage from Internet on C# or C++ - c#

I'm making a WinForms project on C#/C++ (depending on the best way I could find to reach my goal, language could be changed). I need to get a page from website and parse it to get some information. I'm a very beginner in web programming with Visual C#/C++ and all the answers I found here are too complicated for me as a beginner. Could you help me to tell which standart classes should I use for getting page from Internet in some form and how to parse it then. I would be very pleased if you have any code examples, cause as I wrote above I have no experience in web coding and have no time to learn every term in detail. Thank you in advance.

You can use c# to download the specific webpage then do the analysis, an code example of downloading:
using System.Net;
using System.IO;
using System.Windows.Forms;
string result = null;
string url = "http://www.devtopics.com";
WebResponse response = null;
StreamReader reader = null;
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( url );
request.Method = "GET";
response = request.GetResponse();
ContentType contentType = new ContentType(response.ContentType);
Encoding encoding = Encoding.GetEncoding(contentType.CharSet);
reader = new StreamReader( response.GetResponseStream(), encoding);
result = reader.ReadToEnd();
}
catch (Exception ex)
{
// handle error
MessageBox.Show( ex.Message );
}
finally
{
if (reader != null)
reader.Close();
if (response != null)
response.Close();
}

Check out this project 'here' and their code examples 'here'

Related

Correct usage of WebRequest

I need to use WebRequest to download a webpage content into a string.
I can't use WebClient instead because it doesn't support certain HTTP Headers which i need to use. I couldn't figure out what's the best practice for handling memory issues in this case (How to correctly dispose it). Is using a using statement enough or do i need to add some try catch stuff in here too?
This is my code so far:
var webRequest = (HttpWebRequest)WebRequest.Create("http://www.gooogle.com");
using (var webResponse = (HttpWebResponse)webRequest.GetResponse()) {
using (var responseStream = webResponse.GetResponseStream()) {
responseStream.ReadTimeout = 30;
using (var streamReader = new StreamReader(responseStream, Encoding.UTF8)) {
var page = streamReader.ReadToEnd();
}
}
Console.WriteLine("Done");
}

Your code is fine (except of course that some exception handling would be nice). You don't need to worry about disposing or closing streams when using using, the compiler generates the code for that.
The best thing would of course be to wrap the above code in a function that returns the page, and put a global try/catch in there, for instance:
public string GetHtmlPage(string urlToFetch)
{
string page = "";
try
{
... code ...
return page;
} catch (Exception exc)
{
throw new HtmlPageRetrievalException(exc);
}
}

My post request to https://qrng.physik.hu-berlin.de/ failed, why?

the page at https://qrng.physik.hu-berlin.de/ provides a high bit rate quantum number generator web service and I'm trying to access that service.
However I could not manage to do so. This is my current code:
using System;
using System.Collections.Generic;
using System.Linq;
using S=System.Text;
using System.Security.Cryptography;
using System.IO;
namespace CS_Console_App
{
class Program
{
static void Main()
{
System.Net.ServicePointManager.Expect100Continue = false;
var username = "testuser";
var password = "testpass";
System.Diagnostics.Debug.WriteLine(Post("https://qrng.physik.hu-berlin.de/", "username="+username+"&password="+password));
Get("http://qrng.physik.hu-berlin.de/download/sampledata-1MB.bin");
}
public static void Get(string url)
{
var my_request = System.Net.WebRequest.Create(url);
my_request.Credentials = System.Net.CredentialCache.DefaultCredentials;
var my_response = my_request.GetResponse();
var my_response_stream = my_response.GetResponseStream();
var stream_reader = new System.IO.StreamReader(my_response_stream);
var content = stream_reader.ReadToEnd();
System.Diagnostics.Debug.WriteLine(content);
stream_reader.Close();
my_response_stream.Close();
}
public static string Post(string url, string data)
{
string vystup = null;
try
{
//Our postvars
byte[] buffer = System.Text.Encoding.ASCII.GetBytes(data);
//Initialisation, we use localhost, change if appliable
System.Net.HttpWebRequest WebReq = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);
//Our method is post, otherwise the buffer (postvars) would be useless
WebReq.Method = "POST";
//We use form contentType, for the postvars.
WebReq.ContentType = "application/x-www-form-urlencoded";
//The length of the buffer (postvars) is used as contentlength.
WebReq.ContentLength = buffer.Length;
//We open a stream for writing the postvars
Stream PostData = WebReq.GetRequestStream();
//Now we write, and afterwards, we close. Closing is always important!
PostData.Write(buffer, 0, buffer.Length);
PostData.Close();
//Get the response handle, we have no true response yet!
System.Net.HttpWebResponse WebResp = (System.Net.HttpWebResponse)WebReq.GetResponse();
//Let's show some information about the response
Console.WriteLine(WebResp.StatusCode);
Console.WriteLine(WebResp.Server);
//Now, we read the response (the string), and output it.
Stream Answer = WebResp.GetResponseStream();
StreamReader _Answer = new StreamReader(Answer);
vystup = _Answer.ReadToEnd();
//Congratulations, you just requested your first POST page, you
//can now start logging into most login forms, with your application
//Or other examples.
}
catch (Exception ex)
{
throw ex;
}
return vystup.Trim() + "\n";
}
}
}
I'm having 403 forbidden error when I try to do a get request on http://qrng.physik.hu-berlin.de/download/sampledata-1MB.bin.
After debugging abit, I've realised that even though I've supplied a valid username and password, the response html that was sent after my POST request indicate that I was actually not logon to the system after my POST request.
Does anyone know why is this the case, and how may I work around it to call the service?
Bump. can anyone get this to work or is the site just a scam?

The site is surely not a scam. I developed the generator and I put my scientific reputation in it. The problem is that you are trying to use the service in a way that was not intended. The sample files were really only meant to be downloaded manually for basic test purposes. Automated access to fetch data into an application was meant to be implemented through the DLLs we provide.
On the other hand, I do not know of any explicit intent to prevent your implementation to work. I suppose if a web browser can log in and fetch data, some program should be able to do the same. Maybe only the login request is just a little more complicated. No idea. The server software was developed by someone else and I cannot bother him with this right now.
Mick

Actually, the generator can now also be purchased. See here:
http://www.picoquant.com/products/pqrng150/pqrng150.htm

Have you tried to change this
my_request.Credentials = System.Net.CredentialCache.DefaultCredentials
to
my_request.Credentials = new NetworkCredential(UserName,Password);
as described on MSDN page?

Is there a more efficient way to get the number of search results from a google query?

Right now I am using this code:
string url = "http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=hey&esrch=FT1";
string source = getPageSource(url);
string[] stringSeparators = new string[] {
"<b>",
"</b>"
};
string[] b = source.Split(stringSeparators, StringSplitOptions.None);
bool isResultNum = false;
foreach(string s in b) {
if (isResultNum) {
MessageBox.Show(s.Replace(",", ""));
return;
}
if (s.Contains(" of about ")) {
isResultNum = true;
}
}
Unfortunately it is very slow, is there a better way to do it? Also is it legal to query google like this? From the answer in this question it didn't sound like it was How to download Google search results?

You already referenced the post mentioning the transition from SOAP API to AJAX.
The RESTful interface should give you what you need since it limits the returned results sets but gives you estimatedResultCount and doesn't seem to raise any legal issues (as of now).
Update
I followed the link from Googles API page to www.json.org and found a link to this library on sourceforge. I did not try it out myself yet but i think it will be helpful for you.
Update 2
It looks like Json.Net offers better support than csjson.
Json.NET sample
...
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(googleUri);
request.Referer = "http://www.your-referer.com";
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
Stream responsestream = response.GetResponseStream();
StreamReader responsereader = new StreamReader(responsestream);
JObject jo = JObject.Parse(responsereader.ReadToEnd());
int resultcount = (int)jo.SelectToken("responseData.cursor.estimatedResultCount");
...

Check if a text file exists in ASP.NET

I need to check if a text file exists on a site on a different domain. The URL could be:
http://sub.somedomain.com/blah/atextfile.txt
I need to do this from code behind. I am trying to use the HttpWebRequest object, but not sure how to do it.
EDIT: I am looking for a light weight way of doing this as I'll be executing this logic every few seconds

HttpWebRequest request = (HttpWebRequest)WebRequest.Create(
"http://sub.somedomain.com/blah/atextfile.txt");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
// FILE EXISTS!
}
response.Close();

You could probably use the method used here:
http://www.eggheadcafe.com/tutorials/aspnet/2c13cafc-be1c-4dd8-9129-f82f59991517/the-lowly-http-head-reque.aspx

Something like this might work for you:
using (WebClient webClient = new WebClient())
{
try
{
using (Stream stream = webClient.OpenRead("http://does.not.exist.com/textfile.txt"))
{
}
}
catch (WebException)
{
throw;
}
}

How to check if a file exists on a server using c# and the WebClient class

In my application I use the WebClient class to download files from a Webserver by simply calling the DownloadFile method. Now I need to check whether a certain file exists prior to downloading it (or in case I just want to make sure that it exists). I've got two questions with that:
What is the best way to check whether a file exists on a server without transfering to much data across the wire? (It's quite a huge number of files I need to check)
Is there a way to get the size of a given remote file without downloading it?
Thanks in advance!

WebClient is fairly limited; if you switch to using WebRequest, then you gain the ability to send an HTTP HEAD request. When you issue the request, you should either get an error (if the file is missing), or a WebResponse with a valid ContentLength property.
Edit: Example code:
WebRequest request = WebRequest.Create(new Uri("http://www.example.com/"));
request.Method = "HEAD";
using(WebResponse response = request.GetResponse()) {
Console.WriteLine("{0} {1}", response.ContentLength, response.ContentType);
}

When you request file using the WebClient Class, the 404 Error (File Not Found) will lead to an exception. Best way is to handle that exception and use a flag which can be set to see if the file exists or not.
The example code goes as follows:
System.Net.HttpWebRequest request = null;
System.Net.HttpWebResponse response = null;
request = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create("www.example.com/somepath");
request.Timeout = 30000;
try
{
response = (System.Net.HttpWebResponse)request.GetResponse();
flag = 1;
}
catch
{
flag = -1;
}
if (flag==1)
{
Console.WriteLine("File Found!!!");
}
else
{
Console.WriteLine("File Not Found!!!");
}
You can put your code in respective if blocks.
Hope it helps!

What is the best way to check whether a file exists on a server
without transfering to much data across the wire?
You can test with WebClient.OpenRead to open the file stream without reading all the file bytes:
using (var client = new WebClient())
{
Stream stream = client.OpenRead(url);
// ^ throws System.Net.WebException: 'Could not find file...' if file is not present
stream.Close();
}
This will indicate if the file exists at the remote location or not.
To fully read the file stream, you would do:
using (var client = new WebClient())
{
Stream stream = client.OpenRead(url);
StreamReader sr = new StreamReader(stream);
Console.WriteLine(sr.ReadToEnd());
stream.Close();
}

In case anyone stuck with ssl certificate issue
ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback
(
delegate { return true; }
);
WebRequest request = WebRequest.Create(new Uri("http://.com/flower.zip"));
request.Method = "HEAD";
using (WebResponse response = request.GetResponse())
{
Console.WriteLine("{0} {1}", response.ContentLength, response.ContentType);
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Getting and parsing webpage from Internet on C# or C++ - c#

Check out this project 'here' and their code examples 'here'

Related

Correct usage of WebRequest

My post request to https://qrng.physik.hu-berlin.de/ failed, why?

Is there a more efficient way to get the number of search results from a google query?

Check if a text file exists in ASP.NET

How to check if a file exists on a server using c# and the WebClient class

Categories

Resources