i am doing a small software that uses google search to display result for people in a desktop software.
User enter keyword and search results being displayed in a list box.
So far i don't have problem with that.
My actual problem is sometimes google blocks my ip address from searching and i have to wait for a certain amount of time then it will do the search normally.
By the way while google blocking my ip address in the software , i am still able to use google in the web browser , weird huh ?
Typically the error message i am getting is :
The remote server returned an error: (503) Server Unavailable.
And here is the method i am using to search google :
private string PROCESS_URL(string url)
{
HttpWebRequest request1 = (HttpWebRequest)WebRequest.Create(url);
request1.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11";
request1.Proxy = new WebProxy("xx.xx.xx.xx:xx"); // << my proxy address : port
HttpWebResponse response1 = (HttpWebResponse)request1.GetResponse();
StreamReader sr1 = new StreamReader(response1.GetResponseStream(), Encoding.UTF8);
string page_source1 = sr1.ReadToEnd();
return page_source1;
}
Any idea what should i consider ? maybe i need to send cookies along with the request ?
Am i missing some attribute ?
Well the idea here is to make google think i am a human not a bot
Related
I'm writing an interface to scrape info from a service. The link is behind a login, so I keep a copy of the cookies and then attempt to loop through the pages to get stats for our users.
The urls to hit are of the format: https://domain.com/groups/members/1234
for the first page, and each subsequent page appends ?page=X
string vUrl = "https://domain.com/groups/members/1234";
if (pageNumber > 1) vUrl += "?page=" + (pageNumber).ToString();
HttpWebRequest groupsRequest = (HttpWebRequest)WebRequest.Create(vUrl);
groupsRequest.CookieContainer = new CookieContainer();
groupsRequest.CookieContainer.Add(cookies); //recover cookies First request
groupsRequest.Method = WebRequestMethods.Http.Get;
groupsRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36";
groupsRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
groupsRequest.UseDefaultCredentials = true;
groupsRequest.AutomaticDecompression = DecompressionMethods.GZip;
groupsRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
groupsRequest.Headers.Add("Cache-Control", "max-age=0");
HttpWebResponse getResponse = (HttpWebResponse)groupsRequest.GetResponse();
This works fine for the first page and I get the data back that I need, but with each subsequent pass, the queryString is ignored. Debugging at the last line shows that RequestUri.Query for the request is correct, but the response RequestUri.Query is blank. So it has the effect of always returning page 1 data.
I've tried to mimic the request headers that I see via Inspect in Chrome, but I'm stuck. Help?
when you put that url that is failing into a browser does it work? Because it is a GET, the browser should make the same request and tell you if it is working. If it does not work in the browser, then perhaps you are missing something aside from the query string?
If it does work, then maybe use fiddler and find out exactly what headers, cookies, and query string values are being sent to make 100% sure that you are sending the correct request. It could be that the query string is not enough information to get the data that you need from the request that you are sending.
If you still can't get it then fiddler the request when you send it through the browser and then use this fiddler extension to turn the request into code and see whats up.
I have no problem accessing the website with a browser, but when I programmatically try to access the website for scraping, I get the following error.
The remote server returned an error: (500) Internal Server Error.
Here is the code I'm using.
using System.Net;
string strURL1 = "http://www.covers.com/index.aspx";
WebRequest req = WebRequest.Create(strURL1);
// Get the stream from the returned web response
StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string strLine;
// Read the stream a line at a time and place each one
while ((strLine = stream.ReadLine()) != null)
{
if (strLine.Length > 0)
sb.Append(strLine + Environment.NewLine);
}
stream.Close();
This one has me stumped. TIA
Its the user agent.
Many sites like the one you're attempting to scrape will validate the user agent string in an attempt to stop you from scraping them. Like it has with you, this quickly stops junior programmers from attempting the scrape. Its not really a very solid way of stopping a scrape - but it stumps some people.
Setting the User-Agent string will work. Change the code to:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(strURL1);
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // Chrome user agent string
..and it will be fine.
It looks like it's doing some sort of user-agent checking. I was able to replicate your problem in PowerShell, but I noticed that the PowerShell cmdlet Invoke-WebRequest was working fine.
So I hooked up Fiddler, reran it, and stole the user-agent string out of Fiddler.
Try to set the UserAgent property to:
User-Agent: Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0
I'm using monotouch for developing my iOS application for iOS 6+ . The basis of the applications is downloading some data from serveres that user introduce.
This servers may work with http or https. I use below code for downloading:
System.Net.HttpWebRequest _HttpWebRequest = (System.Net.HttpWebRequest)System.Net.HttpWebRequest.Create (Url);
_HttpWebRequest.AllowWriteStreamBuffering = true;
_HttpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)";
_HttpWebRequest.Timeout = 15000;
_HttpWebRequest .Method ="GET";
System.Net.WebResponse _WebResponse = null;
_WebResponse = _HttpWebRequest.GetResponse ();
System.IO.Stream _WebStream = _WebResponse.GetResponseStream ();
var bytes = HelperMethods .ReadToEnd (_WebStream);
_WebResponse.Close ();
_WebResponse.Close ();
return System.Text.Encoding.UTF8.GetString(bytes);
it works when the server are http but when the servers are https I should add https:// before the host name for working. So how can I detect that whether a host is working with https or http, before sending requests.
You cannot; and there is nothing whatsoever that says that http://example.com and https://example.com need to represent the same information, although by convention it almost always does. If you are content to assume that the two are equivalent, you'll just need to try both (or at least, decide which to try first).
I'm trying to check the redirect location of a url but am always getting the wrong results. For example, for the url http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html, it redirects to http://www.lotus-air.com with a type of redirect 302 Found (you can test it on the this service http://www.internetofficer.com/seo-tool/redirect-check/), however am getting "http://mobile.yellowpages.com.eg/" as the webResp.GetResponseHeader("Location") . My Code is as follows:
string url = #"http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html";
HttpWebRequest webReq = WebRequest.Create(url) as HttpWebRequest;
webReq.Method = "HEAD";
webReq.AllowAutoRedirect = false;
HttpWebResponse webResp = webReq.GetResponse() as HttpWebResponse;
txtOutput.Text += webResp.StatusCode.ToString() + "\r\n" ;
txtOutput.Text += webResp.GetResponseHeader("Location") + "\r\n";
txtOutput.Text += webResp.ResponseUri.ToString();
webResp.Close();
Thanks.
Yehia
They are probably sending different redirects based on the user agent, so you get one result in a browser and another in your code.
You could use a HTTP debugging proxy to get an understanding of the headers moving back and forth and enables to you to change your user-agent to help test Ben's theory (I +1'd that).
A good one is Fiddler - Web Debugging Proxy free and easy to use/
The screenshot below shows me changing the useragent to an old IEMobile one "Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 6.12; en-US; KIN.Two 1.0)", which redirects me to mobile.yellowpages.com.eg
n.b. changing to an ipad useragent takes you to iphone.yellowpages.com.eg
As Ben pointed out, it redirects based on user agent. Just add some user agent (this one is for chrome):
webReq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13";
For me it redirects to http://www.lotus-air.com.
When i trying to load html from server by https, it returning an error code 500: but when i open same link in browser it works fine: is there any way to do this? I'm using Webclient and also sending a useragent information to the server:
HttpWebRequest req1 = (HttpWebRequest)WebRequest.Create("mobile.unibet.com/";);
req1.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
var response1 = req1.GetResponse();
var responsestream1 = response1.GetResponseStream();
David is correct, this generally happens when the server is expecting some headers that is not passed through, in your case Accept
this code works now
string requestUrl = "https://mobile.unibet.com/unibet_index.t";
var request = (HttpWebRequest)WebRequest.Create(requestUrl);
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.UserAgent = "//Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
using (var response = request.GetResponse() as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
var responsestring = sr.ReadToEnd();
if (!string.IsNullOrEmpty(responsestring))
{
Console.WriteLine(responsestring);
}
}
}
This should probably be a comment but there's not enough room in the comment for all the questions... I don't think the question has enough information to answer with any level of confidence.
A 500 error means a problem at the server. The short answer is that the browser is sending some content that the WebClient is not.
The WebClient may not be sending headers that are expected by the server. Does the server require authentication? Is this a page on a company that you've contracted with that perhaps provided you with credentials or an API key that was Do you need to add HTTP Authorization?
If this is something you're doing with a company that you've got a partnership with, you should be able to ask them to help trace why you're getting a 500 error. Otherwise, you may need to provide us with a code sample and more details so we can offer more suggestions.