I'm writing an interface to scrape info from a service. The link is behind a login, so I keep a copy of the cookies and then attempt to loop through the pages to get stats for our users.
The urls to hit are of the format: https://domain.com/groups/members/1234
for the first page, and each subsequent page appends ?page=X
string vUrl = "https://domain.com/groups/members/1234";
if (pageNumber > 1) vUrl += "?page=" + (pageNumber).ToString();
HttpWebRequest groupsRequest = (HttpWebRequest)WebRequest.Create(vUrl);
groupsRequest.CookieContainer = new CookieContainer();
groupsRequest.CookieContainer.Add(cookies); //recover cookies First request
groupsRequest.Method = WebRequestMethods.Http.Get;
groupsRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36";
groupsRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
groupsRequest.UseDefaultCredentials = true;
groupsRequest.AutomaticDecompression = DecompressionMethods.GZip;
groupsRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
groupsRequest.Headers.Add("Cache-Control", "max-age=0");
HttpWebResponse getResponse = (HttpWebResponse)groupsRequest.GetResponse();
This works fine for the first page and I get the data back that I need, but with each subsequent pass, the queryString is ignored. Debugging at the last line shows that RequestUri.Query for the request is correct, but the response RequestUri.Query is blank. So it has the effect of always returning page 1 data.
I've tried to mimic the request headers that I see via Inspect in Chrome, but I'm stuck. Help?
when you put that url that is failing into a browser does it work? Because it is a GET, the browser should make the same request and tell you if it is working. If it does not work in the browser, then perhaps you are missing something aside from the query string?
If it does work, then maybe use fiddler and find out exactly what headers, cookies, and query string values are being sent to make 100% sure that you are sending the correct request. It could be that the query string is not enough information to get the data that you need from the request that you are sending.
If you still can't get it then fiddler the request when you send it through the browser and then use this fiddler extension to turn the request into code and see whats up.
Related
I need to scrape a table of info from a site for which I have valid credentials because the owners of the site do not provide an API.
I performed a login and saved the traffic with Fiddler, and am trying to replicate the key steps.
I'm going to show the steps I've done so far, and get to where I am stuck.
Log into the base url
CookieContainer jar = new CookieContainer();
request = (HttpWebRequest)WebRequest.Create(urlBase);
request.CookieContainer = jar;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
newUrl = response.ResponseUri.ToString();
Along with the return a cookie is set. When I look at the CookieContainer it has a count of 1 after the call.
Interestingly the response object does not contain the cookie - but I think all is okay because I can use jar.
2nd call
I'm not yet at the page where the name and password are presented, that doesn't happen until the 4th call.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlBase +
secondCallFolderAddition);
CookieCollection bakery = new CookieCollection();
request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", #"1");
//request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.57";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp, image/apng,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string newURL = response.ResponseUri.ToString();
I get an OK status, and the response looks good compared to the original Fiddler traffic capture. In the original this 2nd call does not set a cookie, and no cookie is set here.
Third call
But here's where I get lost: the browser sent cookie data with three values (I've obfuscated):
__utma=1.123456789.123456789.123456789.123456789.1
olfsk=olfsk12345678901234567890123456789
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS
I don't know where those values get set. They seem to relate to Google Analytics (from articles I've found) but I don't know how to collect them so that I can attach them to the call I make.
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(newUrl);
request.KeepAlive = true;
request.Headers.Add("Upgrade-Insecure-Requests", "1");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36 OPR/46.0.2597.57";
request.Accept = "text/html,application/xhtml+xml,application/xml;
q=0.9,image/webp,image/apng,*/*;q=0.8";
request.Headers.Set(HttpRequestHeader.AcceptEncoding, "gzip, deflate, br");
request.Headers.Set(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
///request.Headers.Set(HttpRequestHeader.Cookie,
#"__utma=1.123456789.123456789.123456789.123456789.1;
olfsk=olfsk12345678901234567890123456789;
hblid=abCDl11ABCabXabc1aABv1FLFX1RE1OS");
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
string newURL = response.ResponseUri.ToString();
Please note the commented out line with the cookie data - I've tried this with that line un-commented also.
What happens is that I never get a response to the call.
I am very appreciative of any insights.
I am guessing that the cookie data in the third call is needed, and that is is set by a client-side script that gets collected between the 2nd and 3rd call - but I am new to this and unsure.
Also - if it is set on the client side, how can I get valid cookies that will get me past this roadblock. (These is another roadblock coming in the next call, where more cookies are used that i do not see set in a server response - but i am not there yet.)
I know I can solve this by using a WebBrowser object, but that seems like a clumsy solution. Is there a less clumsy way to go? Are there other objects or libraries I should try? (RestSharp? Postman? Webrequest object instead of HTTPWeRequest?)
I am attempting to make a webscraper to collect news articles however I am having trouble obtaining the full html content of the webpage. Here is the url that I initially need to scrape for article search results:
Then, I scrape each individual article (example).
I have tried using WebRequest, HTTPWebRequest, and WebClient to make my requests, however the result that is returned each time only contains the html content for the sidebar, etc. I have used Chrome developer tools and the returned html begins just after the main content of the page, and therefore is unhelpful. I also have looked for ajax calls for the content and have not been able to find any.
I have successfully been able to scrape the needed content using Selenium Webdriver, however this is not ideal as it is much slower to visit every url, and it often gets hung up loading pages. Any help with requesting the full html contents of the page would be greatly appreciated.
I am not sure what issue you having, but here's how I accomplished your task.
First I viewed the page in my web browser with the network tab open in developer tools.
From here I collected a list of headers my real browser sent. I then constructed an HttpWebRequest appending the subsequent headers and was able to retrieve the full html of the page.
public string getHtml()
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://www.fa-mag.com/search.php?and_or=and&date_range=all&magazine=&sort=newest&method=basic&query=ubs");
req.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0";
req.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
req.AllowAutoRedirect = false;
req.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.5");
req.Headers.Add("cookie", "analytics_id=595127c20cdfe6.52043028595127c20ce022.71834842; PHPSESSID=tbbo7npldsv26n2q7pg2728k77; D_IID=3E4FEA7F-9794-34EE-99F8-87EEA3DF0689; D_UID=5F374D94-270D-3653-8C54-9A46F381EAE2; D_ZID=505BB8EF-5A2D-3CBD-87D8-FABAD5014776; D_ZUID=BB0C9EF2-0E7B-383E-A03A-A3E92CC8051A; D_HID=9642D775-D860-3F04-8720-73E5339042BA; D_SID=63.138.127.22:6Ci6jv2Xv+yum3m9lNfnyRcAylne67YfnS/u8goKrxQ");
req.Headers.Add("DNT", "1");
req.Headers.Add("Upgrade-Insecure-Requests", "1");
HttpWebResponse res = null;
try
{
res = (HttpWebResponse)req.GetResponse();
}
catch (WebException webex)
{
res = (HttpWebResponse)webex.Response;
}
string html = new StreamReader(res.GetResponseStream()).ReadToEnd();
return html;
}
Without the custom headers, there is bot protection on the page that sends a 416 response and does a redirect. If you read the html in the redirect page it states the site has detected you as a bot.
I have no problem accessing the website with a browser, but when I programmatically try to access the website for scraping, I get the following error.
The remote server returned an error: (500) Internal Server Error.
Here is the code I'm using.
using System.Net;
string strURL1 = "http://www.covers.com/index.aspx";
WebRequest req = WebRequest.Create(strURL1);
// Get the stream from the returned web response
StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string strLine;
// Read the stream a line at a time and place each one
while ((strLine = stream.ReadLine()) != null)
{
if (strLine.Length > 0)
sb.Append(strLine + Environment.NewLine);
}
stream.Close();
This one has me stumped. TIA
Its the user agent.
Many sites like the one you're attempting to scrape will validate the user agent string in an attempt to stop you from scraping them. Like it has with you, this quickly stops junior programmers from attempting the scrape. Its not really a very solid way of stopping a scrape - but it stumps some people.
Setting the User-Agent string will work. Change the code to:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(strURL1);
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // Chrome user agent string
..and it will be fine.
It looks like it's doing some sort of user-agent checking. I was able to replicate your problem in PowerShell, but I noticed that the PowerShell cmdlet Invoke-WebRequest was working fine.
So I hooked up Fiddler, reran it, and stole the user-agent string out of Fiddler.
Try to set the UserAgent property to:
User-Agent: Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0
I'm trying to check the redirect location of a url but am always getting the wrong results. For example, for the url http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html, it redirects to http://www.lotus-air.com with a type of redirect 302 Found (you can test it on the this service http://www.internetofficer.com/seo-tool/redirect-check/), however am getting "http://mobile.yellowpages.com.eg/" as the webResp.GetResponseHeader("Location") . My Code is as follows:
string url = #"http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html";
HttpWebRequest webReq = WebRequest.Create(url) as HttpWebRequest;
webReq.Method = "HEAD";
webReq.AllowAutoRedirect = false;
HttpWebResponse webResp = webReq.GetResponse() as HttpWebResponse;
txtOutput.Text += webResp.StatusCode.ToString() + "\r\n" ;
txtOutput.Text += webResp.GetResponseHeader("Location") + "\r\n";
txtOutput.Text += webResp.ResponseUri.ToString();
webResp.Close();
Thanks.
Yehia
They are probably sending different redirects based on the user agent, so you get one result in a browser and another in your code.
You could use a HTTP debugging proxy to get an understanding of the headers moving back and forth and enables to you to change your user-agent to help test Ben's theory (I +1'd that).
A good one is Fiddler - Web Debugging Proxy free and easy to use/
The screenshot below shows me changing the useragent to an old IEMobile one "Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 6.12; en-US; KIN.Two 1.0)", which redirects me to mobile.yellowpages.com.eg
n.b. changing to an ipad useragent takes you to iphone.yellowpages.com.eg
As Ben pointed out, it redirects based on user agent. Just add some user agent (this one is for chrome):
webReq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13";
For me it redirects to http://www.lotus-air.com.
I am trying to log into a website to send SMS via a windows phone 7 app. I have 2 providers working but when I try Vodafone I am running into an error.
From what I gather it seems that the response does not contain cookies, or they are not being read. The request logs in ok and the response I get back is the correct page but it contains no cookies.
The Url:
RequestUrl = String.Format("https://www.vodafone.ie/myv/services/login/Login.shtml?username={0}&password={1}", userSettings.Username, userSettings.Password),
The Request:
Request = (HttpWebRequest)WebRequest.Create((requestCollection.CurrentRequest().RequestUrl));
if (Request.CookieContainer == null)
{
Request.CookieContainer = cookieJar.CookieContainer;
Request.AllowAutoRedirect = true;
Request.AllowReadStreamBuffering = true;
Request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5.0.6";
}
Where the code errors as the response cookies could not be evaluated:
public void AddCookiesToContainer(HttpWebResponse response)
{
CookieCollection.Add(response.Cookies);
CookieContainer.Add(response.ResponseUri, CookieCollection);
}
And below is the debugger showing no cookies :(
Which line of the code has the error?
Have you verified that the service does return cookies? (i.e. If you make the same request from a PC)
Edit:
The remote host is returning cookies in it's redirection to the index page but in that page there are no cookies in the response. This would explain why there are no cookies in the collection when you try and use it.
Verify this behaviour against a PC client, inspect the body of the response from index.jsp ans this may contain information to help debug and check the documentation on how the process is supposed to work.