How to get full webpage HTML in c#? - c#

I am attempting to make a webscraper to collect news articles however I am having trouble obtaining the full html content of the webpage. Here is the url that I initially need to scrape for article search results:
Then, I scrape each individual article (example).
I have tried using WebRequest, HTTPWebRequest, and WebClient to make my requests, however the result that is returned each time only contains the html content for the sidebar, etc. I have used Chrome developer tools and the returned html begins just after the main content of the page, and therefore is unhelpful. I also have looked for ajax calls for the content and have not been able to find any.
I have successfully been able to scrape the needed content using Selenium Webdriver, however this is not ideal as it is much slower to visit every url, and it often gets hung up loading pages. Any help with requesting the full html contents of the page would be greatly appreciated.

I am not sure what issue you having, but here's how I accomplished your task.
First I viewed the page in my web browser with the network tab open in developer tools.
From here I collected a list of headers my real browser sent. I then constructed an HttpWebRequest appending the subsequent headers and was able to retrieve the full html of the page.
public string getHtml()
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.fa-mag.com/search.php?and_or=and&date_range=all&magazine=&sort=newest&method=basic&query=ubs");
req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.AllowAutoRedirect = false;
req.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.5");
req.Headers.Add("cookie", "analytics_id=595127c20cdfe6.52043028595127c20ce022.71834842; PHPSESSID=tbbo7npldsv26n2q7pg2728k77; D_IID=3E4FEA7F-9794-34EE-99F8-87EEA3DF0689; D_UID=5F374D94-270D-3653-8C54-9A46F381EAE2; D_ZID=505BB8EF-5A2D-3CBD-87D8-FABAD5014776; D_ZUID=BB0C9EF2-0E7B-383E-A03A-A3E92CC8051A; D_HID=9642D775-D860-3F04-8720-73E5339042BA; D_SID=63.138.127.22:6Ci6jv2Xv+yum3m9lNfnyRcAylne67YfnS/u8goKrxQ");
req.Headers.Add("DNT", "1");
req.Headers.Add("Upgrade-Insecure-Requests", "1");
HttpWebResponse res = null;
try
{
res = (HttpWebResponse)req.GetResponse();
}
catch (WebException webex)
{
res = (HttpWebResponse)webex.Response;
}
string html = new StreamReader(res.GetResponseStream()).ReadToEnd();
return html;
}
Without the custom headers, there is bot protection on the page that sends a 416 response and does a redirect. If you read the html in the redirect page it states the site has detected you as a bot.

Related

How to parse full text webpage from this page?

I need to get the track names from this page but i get uncomplete response
var response = await client.GetStringAsync(new Uri("http://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi"));
I used firefox inspector, sent post request, used mobile and desktop user agent strings but stil got uncomplete response.
but i noticed that i get the full page text when i create a download task on uc browser with that address.
How can i get the complete page text?
In a test app, I used a valid URL (by not using & in the request instead using an ampersand & directly), and the response returns correctly:
var client = new HttpClient();
var response = await client.GetStringAsync(new Uri("https://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi"));
That being said, your original query also returns successfully, it's just doing several redirects before it fully returns.
But, I did notice that the HTML page being returned isn't entirely valid as it contains error information at the beginning of the response:
The process cannot access the file 'C:\inetpub\PMWebsite\Log\500_2016-04-03.log' because it is being used by another process.
<!DOCTYPE html>
<html lang="en" class="app">
<head><meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /><title>
You might want to check with the creators of the web site to check whether scraping their web application content is OK and if they might have a direct API you could use instead.
I'm still figuring out why the client.GetStringAsync is not working, but I was able to get the page html by using System.Net.HttpWebRequest.
Code sample below.
Uri address = new Uri("http://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi");
HttpWebRequest httpRequest = WebRequest.Create(address) as HttpWebRequest;
httpRequest.UseDefaultCredentials = true;
httpRequest.ServicePoint.Expect100Continue = false;
httpRequest.Proxy.Credentials = CredentialCache.DefaultCredentials;
httpRequest.ProtocolVersion = HttpVersion.Version11;
httpRequest.UserAgent = #"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0";
httpRequest.Method = "GET";
httpRequest.Timeout = 3000;
HttpWebResponse response = httpRequest.GetResponse() as HttpWebResponse;
StreamReader reader = new StreamReader(response.GetResponseStream());
string html = reader.ReadToEnd();
response.Close();

Querystring being ignored

I'm writing an interface to scrape info from a service. The link is behind a login, so I keep a copy of the cookies and then attempt to loop through the pages to get stats for our users.
The urls to hit are of the format: https://domain.com/groups/members/1234
for the first page, and each subsequent page appends ?page=X
string vUrl = "https://domain.com/groups/members/1234";
if (pageNumber > 1) vUrl += "?page=" + (pageNumber).ToString();
HttpWebRequest groupsRequest = (HttpWebRequest)WebRequest.Create(vUrl);
groupsRequest.CookieContainer = new CookieContainer();
groupsRequest.CookieContainer.Add(cookies); //recover cookies First request
groupsRequest.Method = WebRequestMethods.Http.Get;
groupsRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36";
groupsRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
groupsRequest.UseDefaultCredentials = true;
groupsRequest.AutomaticDecompression = DecompressionMethods.GZip;
groupsRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
groupsRequest.Headers.Add("Cache-Control", "max-age=0");
HttpWebResponse getResponse = (HttpWebResponse)groupsRequest.GetResponse();
This works fine for the first page and I get the data back that I need, but with each subsequent pass, the queryString is ignored. Debugging at the last line shows that RequestUri.Query for the request is correct, but the response RequestUri.Query is blank. So it has the effect of always returning page 1 data.
I've tried to mimic the request headers that I see via Inspect in Chrome, but I'm stuck. Help?
when you put that url that is failing into a browser does it work? Because it is a GET, the browser should make the same request and tell you if it is working. If it does not work in the browser, then perhaps you are missing something aside from the query string?
If it does work, then maybe use fiddler and find out exactly what headers, cookies, and query string values are being sent to make 100% sure that you are sending the correct request. It could be that the query string is not enough information to get the data that you need from the request that you are sending.
If you still can't get it then fiddler the request when you send it through the browser and then use this fiddler extension to turn the request into code and see whats up.

Using C# HttpClient to login on a website and scrape information from another page

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.
I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.
Can someone explain or point me to a resource ?
For now, I have only some code to get the content of the home page and the login page:
public class Morningstar
{
public async static void Ru4n()
{
var url = "http://www.morningstar.com/";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
var response = await httpClient.GetAsync(new Uri(url));
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync())
using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(decompressedStream))
{
//Console.WriteLine(streamReader.ReadToEnd());
}
var loginURL = "https://members.morningstar.com/memberservice/login.aspx";
response = await httpClient.GetAsync(new Uri(loginURL));
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync())
using (var streamReader = new StreamReader(responseStream))
{
Console.WriteLine(streamReader.ReadToEnd());
}
}
EDIT: In the end, on the advice of Muhammed, I used the following piece of code:
ScrapingBrowser browser = new ScrapingBrowser();
//set UseDefaultCookiesParser as false if a website returns invalid cookies format
//browser.UseDefaultCookiesParser = false;
WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx"));
PageWebForm form = homePage.FindFormById("memberLoginForm");
form["email_textbox"] = "example#example.com";
form["pwd_textbox"] = "password";
form["go_button.x"] = "57";
form["go_button.y"] = "22";
form.Method = HttpVerb.Post;
WebPage resultsPage = form.Submit();
You should simulate the login process of the web site. The easiest way to do this is inspecting the website via some debugger (for example Fiddler).
Here is the login request of the web site:
POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: https://members.morningstar.com/memberservice/login.aspx
** omitted **
Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me
email_textbox=test#email.com&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited
When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this field to log in. You can use following steps:
Make a request and scrape fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION", and cookies.
Create a new POST request to the same page, use CookieContainer from the previous one; build a post string using scraped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
If successful use the cookies for further requests to stay logged in.
Note: You can use htmlagilitypack, or scrapysharp to scrape html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.
the mental is process is simulate a person login in the website, some logins are made using AJAX or traditional POST request, so, the first thing you need to do, is made that request like browser does, in the server response, you will get cookies, headers and other information, you need to use that info to build a new request, this are the scrappy request.
Steps are:
1) Build a request, like browser does, to authenticate yourself to the app.
2) Inspect response, and saves headers, cookies or other useful info to persisting your session with the server.
3) Make another request to server, using the info you gathered from second step.
4) Inspect response, and use data analysis algorithm or something else to extract the data.
Tips:
You are not using here javascript engine, some websites use javascript to show graphs, or execute some interation in the DOM document. In that cases, maybe you will need to use WebKit lib wrapper.

Console app to login to ASP.NET website

First, please excuse my naivety with this subject. I'm a retired programmer that started before DOS was around. I'm not an expert on ASP.NET. Part of what I need to know is what I need to know. (If yo follow me...)
So I want to log into a web site and scrape some content. After looking at the HTML source with notepad and fiddler2, it's clear to me that the site is implemented with ASP.NET technologies.
I started by doing a lot of google'ing and reading everything I could find about writing screen scrapers in c#. After some investigation and many attempts, I think I've come to the conclusion that it isn't easy.
The crux of the problem (as I see it now) is that ASP provides lots of ways for a programmer to maintain state. Cookies, viewstate, session vars, page vars, get and post params, etc. Plus the programmer can divide the work up between server and client scripting. A rich web client such as IE or Safari or Chrome or Firefox knows how to handle whatever the programmer writes (and the ASP framework implements under the covers).
WebClient isn't a rich web client. It doesn't even know how to implement cookies.
So I'm at an impasse. One way to go is to try to reverse engineer all the features of the rich client that the ASP application is expecting and write a WebClient on steroids class that mimics a rich client well enough to got logged in.
Or I could try embedding IE (or some other rich client) into my app and hope the exposed interface is rich enough that I can programmatically fill a username and password field and POST the form back. (And access the response stream so I can parse the HTML to scrape out the data I'm after...)
Or I could look for some 3rd party control that would be a lot richer that WebClient.
Can anyone shed some keen insight into where I should focus my attention?
This is as much a learning experience as a project. That said, I really want to automate login and information retrieval from the target site.
Here an example function I use to log in website and get my cookie
string loginSite(string url, string username, string password)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
string cookie = "";
//this values will change depending on the website
string values = "vb_login_username=" + username + "&vb_login_password=" + password
+ "&securitytoken=guest&"
+ "cookieuser=checked&"
+ "do=login";
req.Method = "POST";
req.ContentType = "application/x-www-form-urlencoded";
req.ContentLength = values.Length;
CookieContainer a = new CookieContainer();
req.CookieContainer = a;
System.Net.ServicePointManager.Expect100Continue = false; // prevents 417 error
using (StreamWriter writer = new StreamWriter(req.GetRequestStream(), System.Text.Encoding.ASCII)) { writer.Write(values); }
HttpWebResponse c = (HttpWebResponse)req.GetResponse();
Stream ResponseStream = c.GetResponseStream();
StreamReader LeerResult = new StreamReader(ResponseStream);
string Source = LeerResult.ReadToEnd();
foreach (Cookie cook in c.Cookies) { cookie = cookie + cook.ToString() + ";"; }
return cookie;
}
And here a call example:
string Cookie = loginSite("http://theurl.comlogin.php?s=c29cea718f052eae2c6ed105df2b7172&do=login", "user", "passwd");
HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create("http://www.theurl.com");
//once you got the cookie you add it to the header.
req.Headers.Add("cookie", Cookie);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
using (Stream respStream = response.GetResponseStream())
{
using (StreamReader sr = new StreamReader(respStream))
{
string s = sr.ReadToEnd();
HtmlReturn = s;
// System.Diagnostics.Debugger.Break();
}
}
With Firefox you could use the extension HTTP-Headers to know what parameters are being set by post and you modify the variable values:
string values = "vb_login_username=" + username + "&vb_login_password=" + password
+ "&securitytoken=guest&"
+ "cookieuser=checked&"
+ "do=login";
To match with parameters on the destination website.
If you decide to Live-HTTP-HEaders for firefox, when you log into the website you will get
the post information from headers, something like that:
GET / HTTP/1.1 Host: www.microsoft.com User-Agent: Mozilla/5.0
(Windows NT 6.1; rv:15.0) Gecko/20100101 Firefox/15.0.1 Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8
Accept-Language: es-es,es;q=0.8,en-us;q=0.5,en;q=0.3 Accept-Encoding:
gzip, deflate Connection: keep-alive Cookie:
WT_FPC=id=82.144.112.152-154450144.30258861:lv=1351580394112:ss=1351575867559;
WT_NVR_RU=0=msdn:1=:2=; omniID=0d2276c2_bbdd_4386_a11d_f8da1dbc5489;
MUID=349E06C547426937362B02CC434269B9;
MC1=GUID=47b2ed8aeea0de4797d3a40cf549dcbb&HASH=8aed&LV=201210&V=4&LU=1351608258765;
A=I&I=AxUFAAAAAAALBwAAukh4HjpMmS4eKtKpWV0ljg!!&V=4; msdn=L=en-US
I suspect you may be able to build a Chrome extension that could do this for you.
By the way, you're not a "security expert" are you?
Why don't you use IE , Automating IE in Windows Forms is very simple ,plus you can easily handle proxy also .

you must use a browser that supports and has JavaScript enabled - c#

Am trying to post using HTTPWebrequest and this is the response i keep getting back:
you must use a browser that supports and has JavaScript enabled
This is my post code:
HttpWebRequest myRequest = null;
myRequest = (HttpWebRequest)HttpWebRequest.Create(submitURL);
myRequest.Headers.Add("Accept-Language", "en-US");
myRequest.Accept = "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/xaml+xml, application/vnd.ms-xpsdocument, application/x-ms-xbap, application/x-ms-application, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
myRequest.Method = WebRequestMethods.Http.Post;
myRequest.Headers.Add("Accept-Language", "en-US");
myRequest.Accept = "*/*, text/xml";
myRequest.ContentType = "application/x-www-form-urlencoded" + "\n" + "\r";
myRequest.CookieContainer = cookieContainer;
myRequest.Headers.Add("UA-CPU", "x86");
myRequest.Headers.Add("Accept-Encoding", "gzip, deflate");
//cPostData section removed as submitting to SO
myRequest.ContentLength = cPostData.Length;
myRequest.ServicePoint.Expect100Continue = false;
StreamWriter streamWriter = new System.IO.StreamWriter(myRequest.GetRequestStream());
streamWriter.Write(cPostData);
streamWriter.Close();
HttpWebResponse httpWebResponse = (HttpWebResponse)myRequest.GetResponse();
StreamReader streamReader = new System.IO.StreamReader(httpWebResponse.GetResponseStream());
string stringResult = streamReader.ReadToEnd();
streamReader.Close();
how do i avoid getting this error?
It is difficult to say what the exact problem is because the server that is receiving you request doesn't think it is valid.
Perhaps the first thing to try would be to set the UserAgent property on your HttpWebRequest to some valid browser's user agent string as the server may be using this value to determine whether or not to serve the page.
This doesn't have anything to do with your code - the web server code has something that detects or relies on Javascript. Most likely a piece of Javascript on the page fills out (or modifies prior to posting) some hidden form field(s).
The solution to this is entirely dependent on what the web server is expecting to happen with that form data.
This is a layman's answer, and not a 100% technically accurate description of the httpWebRequest object, and meant that way bcause of the amount of time it would take to post it. The first part of this answer is to clarify the final sentence.
The httpWebRequest object basically acts as a browser that is interacting with web pages. It's a very simple browser, with no UI. it's designed basically to be able to post to and read from web pages. As such, it does not support a variety of features normally found in a browser these days, such as JavaScript.
The page you are attempting to post to requires javascript, which the httpWebRequest object does not support. If you have no control over the page that the WebRequst object is posting to, then you'll have to find another wat to post to it. If you own or control the page, you will need to modify the page to strip out items that require javascript (such as Ajax features, etc).
Added
I purposely didn't add anything about specifying a user-agent to try to trick the web server into thinking the httpWebRequest object supports JavaScript. because it is likely that the page really needs to have JavaScript enabled in order for the page to be displayed properly. However, a lot of my assumptions prove wrong, so I would agree with #Andrew Hare and say it's worth a try.

Categories