Error 500 Web Request Can't Scrape WebSite - c#

I have no problem accessing the website with a browser, but when I programmatically try to access the website for scraping, I get the following error.
The remote server returned an error: (500) Internal Server Error.
Here is the code I'm using.
using System.Net;
string strURL1 = "http://www.covers.com/index.aspx";
WebRequest req = WebRequest.Create(strURL1);
// Get the stream from the returned web response
StreamReader stream = new StreamReader(req.GetResponse().GetResponseStream());
System.Text.StringBuilder sb = new System.Text.StringBuilder();
string strLine;
// Read the stream a line at a time and place each one
while ((strLine = stream.ReadLine()) != null)
{
if (strLine.Length > 0)
sb.Append(strLine + Environment.NewLine);
}
stream.Close();
This one has me stumped. TIA

Its the user agent.
Many sites like the one you're attempting to scrape will validate the user agent string in an attempt to stop you from scraping them. Like it has with you, this quickly stops junior programmers from attempting the scrape. Its not really a very solid way of stopping a scrape - but it stumps some people.
Setting the User-Agent string will work. Change the code to:
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(strURL1);
req.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"; // Chrome user agent string
..and it will be fine.

It looks like it's doing some sort of user-agent checking. I was able to replicate your problem in PowerShell, but I noticed that the PowerShell cmdlet Invoke-WebRequest was working fine.
So I hooked up Fiddler, reran it, and stole the user-agent string out of Fiddler.
Try to set the UserAgent property to:
User-Agent: Mozilla/5.0 (Windows NT; Windows NT 6.2; en-US) WindowsPowerShell/4.0

Related

Querystring being ignored

I'm writing an interface to scrape info from a service. The link is behind a login, so I keep a copy of the cookies and then attempt to loop through the pages to get stats for our users.
The urls to hit are of the format: https://domain.com/groups/members/1234
for the first page, and each subsequent page appends ?page=X
string vUrl = "https://domain.com/groups/members/1234";
if (pageNumber > 1) vUrl += "?page=" + (pageNumber).ToString();
HttpWebRequest groupsRequest = (HttpWebRequest)WebRequest.Create(vUrl);
groupsRequest.CookieContainer = new CookieContainer();
groupsRequest.CookieContainer.Add(cookies); //recover cookies First request
groupsRequest.Method = WebRequestMethods.Http.Get;
groupsRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36";
groupsRequest.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";
groupsRequest.UseDefaultCredentials = true;
groupsRequest.AutomaticDecompression = DecompressionMethods.GZip;
groupsRequest.Headers.Add("Accept-Language", "en-US,en;q=0.8");
groupsRequest.Headers.Add("Cache-Control", "max-age=0");
HttpWebResponse getResponse = (HttpWebResponse)groupsRequest.GetResponse();
This works fine for the first page and I get the data back that I need, but with each subsequent pass, the queryString is ignored. Debugging at the last line shows that RequestUri.Query for the request is correct, but the response RequestUri.Query is blank. So it has the effect of always returning page 1 data.
I've tried to mimic the request headers that I see via Inspect in Chrome, but I'm stuck. Help?
when you put that url that is failing into a browser does it work? Because it is a GET, the browser should make the same request and tell you if it is working. If it does not work in the browser, then perhaps you are missing something aside from the query string?
If it does work, then maybe use fiddler and find out exactly what headers, cookies, and query string values are being sent to make 100% sure that you are sending the correct request. It could be that the query string is not enough information to get the data that you need from the request that you are sending.
If you still can't get it then fiddler the request when you send it through the browser and then use this fiddler extension to turn the request into code and see whats up.

C# - response from application is not the same as the response from the browser

I am trying to get the Bing wallpaper data using the following request url: http://www.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1
I have the following code:
private string getJsonData()
{
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Clear();
client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "application/json");
client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Dragon/43.3.3.185 Chrome/43.0.2357.81 Safari/537.36");
using (var response = client.GetAsync("http://www.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1").Result)
{
response.EnsureSuccessStatusCode();
return response.Content.ReadAsStringAsync().Result;
}
}
}
The problem is that I receive the copyrightlink equal to javascript:void(0), and as you can see, if I make the same request with the browser, I get a valid URL: http://www.bing.com/search?q=Brooklyn+Heights,+New+York&form=hpcapt&filters=HpDate:%2220150906_0700%22
I have tried quite a lot of things regarding the headers sent with the request, with no success, so I suppose the problem is coming from somewhere else. Any suggestions?
Note: the same issue is present when using xml as the requested format
Thanks!
As javascript:void(0) means just undefined, I guess for the specific case there is just no copyright link (pointing to author web page, or something like that) at all. Only "© Andrew C. Mace/Getty Images"
In the end, I've found the issue: it looks like I have to add the region in the request URL, like that: http://www.bing.com/HPImageArchive.aspx?format=js&idx=0&n=1&mkt=en-US. The copyrightlink is not javascript:void(0) in this case.

Search google with proxy

i am doing a small software that uses google search to display result for people in a desktop software.
User enter keyword and search results being displayed in a list box.
So far i don't have problem with that.
My actual problem is sometimes google blocks my ip address from searching and i have to wait for a certain amount of time then it will do the search normally.
By the way while google blocking my ip address in the software , i am still able to use google in the web browser , weird huh ?
Typically the error message i am getting is :
The remote server returned an error: (503) Server Unavailable.
And here is the method i am using to search google :
private string PROCESS_URL(string url)
{
HttpWebRequest request1 = (HttpWebRequest)WebRequest.Create(url);
request1.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11";
request1.Proxy = new WebProxy("xx.xx.xx.xx:xx"); // << my proxy address : port
HttpWebResponse response1 = (HttpWebResponse)request1.GetResponse();
StreamReader sr1 = new StreamReader(response1.GetResponseStream(), Encoding.UTF8);
string page_source1 = sr1.ReadToEnd();
return page_source1;
}
Any idea what should i consider ? maybe i need to send cookies along with the request ?
Am i missing some attribute ?
Well the idea here is to make google think i am a human not a bot

Detecting 302 Redirect

I'm trying to check the redirect location of a url but am always getting the wrong results. For example, for the url http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html, it redirects to http://www.lotus-air.com with a type of redirect 302 Found (you can test it on the this service http://www.internetofficer.com/seo-tool/redirect-check/), however am getting "http://mobile.yellowpages.com.eg/" as the webResp.GetResponseHeader("Location") . My Code is as follows:
string url = #"http://www.yellowpages.com.eg/Mjg3NF9VUkxfMTEwX2h0dHA6Ly93d3cubG90dXMtYWlyLmNvbV8=/Lotus-Air/profile.html";
HttpWebRequest webReq = WebRequest.Create(url) as HttpWebRequest;
webReq.Method = "HEAD";
webReq.AllowAutoRedirect = false;
HttpWebResponse webResp = webReq.GetResponse() as HttpWebResponse;
txtOutput.Text += webResp.StatusCode.ToString() + "\r\n" ;
txtOutput.Text += webResp.GetResponseHeader("Location") + "\r\n";
txtOutput.Text += webResp.ResponseUri.ToString();
webResp.Close();
Thanks.
Yehia
They are probably sending different redirects based on the user agent, so you get one result in a browser and another in your code.
You could use a HTTP debugging proxy to get an understanding of the headers moving back and forth and enables to you to change your user-agent to help test Ben's theory (I +1'd that).
A good one is Fiddler - Web Debugging Proxy free and easy to use/
The screenshot below shows me changing the useragent to an old IEMobile one "Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 6.12; en-US; KIN.Two 1.0)", which redirects me to mobile.yellowpages.com.eg
n.b. changing to an ipad useragent takes you to iphone.yellowpages.com.eg
As Ben pointed out, it redirects based on user agent. Just add some user agent (this one is for chrome):
webReq.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13";
For me it redirects to http://www.lotus-air.com.

C# webclient cannot getting response from https protocol

When i trying to load html from server by https, it returning an error code 500: but when i open same link in browser it works fine: is there any way to do this? I'm using Webclient and also sending a useragent information to the server:
HttpWebRequest req1 = (HttpWebRequest)WebRequest.Create("mobile.unibet.com/";);
req1.UserAgent = #"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5";
var response1 = req1.GetResponse();
var responsestream1 = response1.GetResponseStream();
David is correct, this generally happens when the server is expecting some headers that is not passed through, in your case Accept
this code works now
string requestUrl = "https://mobile.unibet.com/unibet_index.t";
var request = (HttpWebRequest)WebRequest.Create(requestUrl);
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.UserAgent = "//Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
using (var response = request.GetResponse() as HttpWebResponse)
{
using (var sr = new StreamReader(response.GetResponseStream()))
{
var responsestring = sr.ReadToEnd();
if (!string.IsNullOrEmpty(responsestring))
{
Console.WriteLine(responsestring);
}
}
}
This should probably be a comment but there's not enough room in the comment for all the questions... I don't think the question has enough information to answer with any level of confidence.
A 500 error means a problem at the server. The short answer is that the browser is sending some content that the WebClient is not.
The WebClient may not be sending headers that are expected by the server. Does the server require authentication? Is this a page on a company that you've contracted with that perhaps provided you with credentials or an API key that was Do you need to add HTTP Authorization?
If this is something you're doing with a company that you've got a partnership with, you should be able to ask them to help trace why you're getting a 500 error. Otherwise, you may need to provide us with a code sample and more details so we can offer more suggestions.

Categories