How to parse full text webpage from this page? - c#

I need to get the track names from this page but i get uncomplete response
var response = await client.GetStringAsync(new Uri("http://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi"));
I used firefox inspector, sent post request, used mobile and desktop user agent strings but stil got uncomplete response.
but i noticed that i get the full page text when i create a download task on uc browser with that address.
How can i get the complete page text?

In a test app, I used a valid URL (by not using & in the request instead using an ampersand & directly), and the response returns correctly:
var client = new HttpClient();
var response = await client.GetStringAsync(new Uri("https://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi"));
That being said, your original query also returns successfully, it's just doing several redirects before it fully returns.
But, I did notice that the HTML page being returned isn't entirely valid as it contains error information at the beginning of the response:
The process cannot access the file 'C:\inetpub\PMWebsite\Log\500_2016-04-03.log' because it is being used by another process.
<!DOCTYPE html>
<html lang="en" class="app">
<head><meta charset="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1" /><title>
You might want to check with the creators of the web site to check whether scraping their web application content is OK and if they might have a direct API you could use instead.

I'm still figuring out why the client.GetStringAsync is not working, but I was able to get the page html by using System.Net.HttpWebRequest.
Code sample below.
Uri address = new Uri("http://parmismedia1.com/musicplayeralbum.aspx?album=666&id=8503&title=farzad-farzin-6-to-che-bashi");
HttpWebRequest httpRequest = WebRequest.Create(address) as HttpWebRequest;
httpRequest.UseDefaultCredentials = true;
httpRequest.ServicePoint.Expect100Continue = false;
httpRequest.Proxy.Credentials = CredentialCache.DefaultCredentials;
httpRequest.ProtocolVersion = HttpVersion.Version11;
httpRequest.UserAgent = #"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0";
httpRequest.Method = "GET";
httpRequest.Timeout = 3000;
HttpWebResponse response = httpRequest.GetResponse() as HttpWebResponse;
StreamReader reader = new StreamReader(response.GetResponseStream());
string html = reader.ReadToEnd();
response.Close();

Related

How to get full webpage HTML in c#?

I am attempting to make a webscraper to collect news articles however I am having trouble obtaining the full html content of the webpage. Here is the url that I initially need to scrape for article search results:
Then, I scrape each individual article (example).
I have tried using WebRequest, HTTPWebRequest, and WebClient to make my requests, however the result that is returned each time only contains the html content for the sidebar, etc. I have used Chrome developer tools and the returned html begins just after the main content of the page, and therefore is unhelpful. I also have looked for ajax calls for the content and have not been able to find any.
I have successfully been able to scrape the needed content using Selenium Webdriver, however this is not ideal as it is much slower to visit every url, and it often gets hung up loading pages. Any help with requesting the full html contents of the page would be greatly appreciated.
I am not sure what issue you having, but here's how I accomplished your task.
First I viewed the page in my web browser with the network tab open in developer tools.
From here I collected a list of headers my real browser sent. I then constructed an HttpWebRequest appending the subsequent headers and was able to retrieve the full html of the page.
public string getHtml()
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.fa-mag.com/search.php?and_or=and&date_range=all&magazine=&sort=newest&method=basic&query=ubs");
req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.AllowAutoRedirect = false;
req.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.5");
req.Headers.Add("cookie", "analytics_id=595127c20cdfe6.52043028595127c20ce022.71834842; PHPSESSID=tbbo7npldsv26n2q7pg2728k77; D_IID=3E4FEA7F-9794-34EE-99F8-87EEA3DF0689; D_UID=5F374D94-270D-3653-8C54-9A46F381EAE2; D_ZID=505BB8EF-5A2D-3CBD-87D8-FABAD5014776; D_ZUID=BB0C9EF2-0E7B-383E-A03A-A3E92CC8051A; D_HID=9642D775-D860-3F04-8720-73E5339042BA; D_SID=63.138.127.22:6Ci6jv2Xv+yum3m9lNfnyRcAylne67YfnS/u8goKrxQ");
req.Headers.Add("DNT", "1");
req.Headers.Add("Upgrade-Insecure-Requests", "1");
HttpWebResponse res = null;
try
{
res = (HttpWebResponse)req.GetResponse();
}
catch (WebException webex)
{
res = (HttpWebResponse)webex.Response;
}
string html = new StreamReader(res.GetResponseStream()).ReadToEnd();
return html;
}
Without the custom headers, there is bot protection on the page that sends a 416 response and does a redirect. If you read the html in the redirect page it states the site has detected you as a bot.

Using C# HttpClient to login on a website and scrape information from another page

I am trying to use C# and Chrome Web Inspector to login on http://www.morningstar.com and retrieve some information on the page http://financials.morningstar.com/income-statement/is.html?t=BTDPF&region=usa&culture=en-US.
I do not quite understand what is the mental process one must use to interpret the information from Web Inspector to simulate a login and simulate keeping the session and navigating to the next page to collect information.
Can someone explain or point me to a resource ?
For now, I have only some code to get the content of the home page and the login page:
public class Morningstar
{
public async static void Ru4n()
{
var url = "http://www.morningstar.com/";
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
var response = await httpClient.GetAsync(new Uri(url));
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync())
using (var decompressedStream = new GZipStream(responseStream, CompressionMode.Decompress))
using (var streamReader = new StreamReader(decompressedStream))
{
//Console.WriteLine(streamReader.ReadToEnd());
}
var loginURL = "https://members.morningstar.com/memberservice/login.aspx";
response = await httpClient.GetAsync(new Uri(loginURL));
response.EnsureSuccessStatusCode();
using (var responseStream = await response.Content.ReadAsStreamAsync())
using (var streamReader = new StreamReader(responseStream))
{
Console.WriteLine(streamReader.ReadToEnd());
}
}
EDIT: In the end, on the advice of Muhammed, I used the following piece of code:
ScrapingBrowser browser = new ScrapingBrowser();
//set UseDefaultCookiesParser as false if a website returns invalid cookies format
//browser.UseDefaultCookiesParser = false;
WebPage homePage = browser.NavigateToPage(new Uri("https://members.morningstar.com/memberservice/login.aspx"));
PageWebForm form = homePage.FindFormById("memberLoginForm");
form["email_textbox"] = "example#example.com";
form["pwd_textbox"] = "password";
form["go_button.x"] = "57";
form["go_button.y"] = "22";
form.Method = HttpVerb.Post;
WebPage resultsPage = form.Submit();
You should simulate the login process of the web site. The easiest way to do this is inspecting the website via some debugger (for example Fiddler).
Here is the login request of the web site:
POST https://members.morningstar.com/memberservice/login.aspx?CustId=&CType=&CName=&RememberMe=true&CookieTime= HTTP/1.1
Accept: text/html, application/xhtml+xml, */*
Referer: https://members.morningstar.com/memberservice/login.aspx
** omitted **
Cookie: cookies=true; TestCookieExist=Exist; fp=001140581745182496; __utma=172984700.91600904.1405817457.1405817457.1405817457.1; __utmb=172984700.8.10.1405817457; __utmz=172984700.1405817457.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=172984700; ASP.NET_SessionId=b5bpepm3pftgoz55to3ql4me
email_textbox=test#email.com&pwd_textbox=password&remember=on&email_textbox2=&go_button.x=36&go_button.y=16&__LASTFOCUS=&__EVENTTARGET=&__EVENTARGUMENT=&__VIEWSTATE=omitted&__EVENTVALIDATION=omited
When you inspect this, you'll see some cookies and form fields like "__VIEWSTATE". You'll need the actual values of this field to log in. You can use following steps:
Make a request and scrape fields like "__LASTFOCUS", "__EVENTTARGET", "__EVENTARGUMENT", "__VIEWSTATE", "__EVENTVALIDATION", and cookies.
Create a new POST request to the same page, use CookieContainer from the previous one; build a post string using scraped fields, username and password. Post it with MIME type application/x-www-form-urlencoded.
If successful use the cookies for further requests to stay logged in.
Note: You can use htmlagilitypack, or scrapysharp to scrape html. ScrapySharp provide easy to use tools for form posting forms and browsing websites.
the mental is process is simulate a person login in the website, some logins are made using AJAX or traditional POST request, so, the first thing you need to do, is made that request like browser does, in the server response, you will get cookies, headers and other information, you need to use that info to build a new request, this are the scrappy request.
Steps are:
1) Build a request, like browser does, to authenticate yourself to the app.
2) Inspect response, and saves headers, cookies or other useful info to persisting your session with the server.
3) Make another request to server, using the info you gathered from second step.
4) Inspect response, and use data analysis algorithm or something else to extract the data.
Tips:
You are not using here javascript engine, some websites use javascript to show graphs, or execute some interation in the DOM document. In that cases, maybe you will need to use WebKit lib wrapper.

WebRequest.GetResponse() returns 404 on valid URl

I'm trying to scrape web page via C# application, but it keeps responding
"The remote server returned an error: (404) Not Found."
The web page is accesible through browser, but the app keeps failing. Any help appreciated.
var d = DateTime.UtcNow.Date;
var AddressString = #"http://www.booking.com/searchresults.html?src=searchresults&si=ai%2Cco%2Cci%2Cre%2Cdi&ss={0}&checkin_monthday={1}&checkin_year_month={2}&checkout_monthday={3}&checkout_year_month={4}";
var URi = String.Format(AddressString, "Prague", d.Day, d.Year + "-" + d.Month, d.Day + 1, d.Year + "-" + d.Month);
var request = (HttpWebRequest)WebRequest.Create(URi);
request.Timeout = 5000;
request.UserAgent = "Fiddler"; //I tried to set next three rows not to be null
request.Credentials = CredentialCache.DefaultCredentials;
request.Proxy = WebProxy.GetDefaultProxy();
try
{
var response = (HttpWebResponse)request.GetResponse();
}
catch(WebException e)
{
var response = (HttpWebResponse)e.Response; //e.Response contains WebPage, but it is incomplete
StreamReader sr = new StreamReader(response.GetResponseStream());
HtmlDocument doc = new HtmlDocument();
doc.Load(sr);
var a = doc.DocumentNode.SelectNodes("div[#class='resut-details']"); //fails, as not all desired nodes arent in response
}
EDIT:
Hi guys, thx for suggestions.
I added header: "Accept-Encoding: gzip,deflate,sdch" according to David Martins reply, but it didn't helped on its own.
I used Fidller to try to get any info about the problem, but I saw that app for the first time and it didn't made me any smarter. On the other hand, I tried to change request.UserAgent to that which is sent by my browser ("User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36";) and voila, I am not getting 404 exception anymore, but the document is not readable, as it is filled with such chars: ¿½O~���G�. I tried setting request.TransferEncoding = "UTF-8", but to enable this propperty, request.SendChunked must be set to true, which ends in
ProtocolViolationException
Additional information: Content-Length or Chunked Encoding cannot be set for an operation that does not write data.
EDIT 2:
I'm forgetting something and I can't figure out what. I'm getting somehow encoded response and need to decode it first to read it correctly. Even in Fiddler, when I want to see response, I need to confirm decoding to inspect result. After I decode it in fiddler, I'm getting just what I want to get into my application...
So, after trying suggestions from Jon Skeet and David Martin I got somewhere further and found relevant answer on new question in another toppic. If anyone ever looked for sth similar, answer is here:
.NET: Is it possible to get HttpWebRequest to automatically decompress gzip'd responses?

get page source code after redirection

I tried some ways to get the page source code of the following website http://www.poppe-bedrijfswagens.nl. This website has a auto redirection set I think.
I tried following ways:
WebClient client = new WebClient();
string sourceCode = "";
sourceCode = client.DownloadString(address);
And
HttpWebRequest myWebRequest = (HttpWebRequest)HttpWebRequest.Create(address);
myWebRequest.AllowAutoRedirect = true;
myWebRequest.Method = "GET";
// make request for web page
HttpWebResponse myWebResponse = (HttpWebResponse)myWebRequest.GetResponse();
StreamReader myWebSource = new StreamReader(myWebResponse.GetResponseStream());
string myPageSource = myWebSource.ReadToEnd();
myWebResponse.Close();
I always get the source code of the first page but i need to get the source code of the page that the website redirected to.
The redirection for http://www.poppe-bedrijfswagens.nl is:
Type of redirect: “meta refresh” redirect after 0 second
Redirected to: http://www.poppe-bedrijfswagens.nl/daf-html/dealer_homepage.html
thanks in advance
The AllowAutoRedirect property is relevant when the redirection is done with an HTTP status code 302. A meta refresh isn't technically a redirection because you are loading the first page.
You can download the first page though and then search the DOM for the element you're interested in <meta http-equiv="refresh" content="0;url=HTTP://WWW.NEXT-URL.COM"> and then download the page you're interested in.

suppress save/dialog in web browser and automate the download

I want to automate the download of an exe prompted from a link from the client side. I can get the first redirected link from http://go.microsoft.com/fwlink/?LinkID=149156 to http://www.microsoft.com/getsilverlight/handlers/getsilverlight.ashx. Please click and check how it works. fwlink -> .ashx - >.exe ...i want to get the direct link to the .exe.
But the response returns 404 when requesting the Web handler through the code but if you try on Browser it actually downloads.
Can anyone suggest how to automate the download form the above link? The code i am using to get the link redirected is this one.
public static string GetLink(string url)
{
HttpWebRequest httpWebRequest = WebRequest.Create(url) as HttpWebRequest;
httpWebRequest.Method = "HEAD";
httpWebRequest.AllowAutoRedirect = false;
// httpWebRequest.ContentType = "application/octet-stream";
//httpWebRequest.Headers.Add("content-disposition", "attachment; filename=Silverlight.exe");
HttpWebResponse httpWebResponse = httpWebRequest.GetResponse() as HttpWebResponse;
if (httpWebResponse.StatusCode == HttpStatusCode.Redirect)
{
return httpWebResponse.GetResponseHeader("Location");
}
else
{
return null;
}
}
Just tested this out and it will download the file.
WebClient client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
client.DownloadFile(url, "Filename.exe");
You just needed to add the user-agent as the particular silverlight download depends on what browser you are running on, hence if it can't detect one then it will fail.
Change the user-agent to something that will trigger the appropriate download you want.

Categories