Getting 403 error while scraping a website page for data - c#

I have a price comparison website which scrapes prices from various websites. For all websites the code is working fine but one is returning 403 forbidden error. The website is developed in Asp.net MVC3 framework. Following is my code.
public static decimal? GetSpanFromWebSite(string url, string identification)
{
var baseUrl = new Uri(url);
HtmlAgilityPack.HtmlDocument document = new HtmlDocument();
try
{
WebClient client = new WebClient();
document.Load(client.OpenRead(baseUrl));
var div = document.DocumentNode.SelectNodes(identification).FirstOrDefault();
return Convert.ToDecimal(div.InnerHtml);
}
catch (Exception)
{
return null;
}
}
What is the workaround and how can i continue scrapping the website?

It is likely a scraping countermeasure implemented by the site.
Try to mimick the browser request as closely as possible (especially headers - user agent, referer, content-type etc.)

403 Forbidden
Actually server is understanding and accepting your request at the same time the server is denying your request, so check your HttpRequest Headers and Cookie values
you can use web debugging tool like fiddler http://www.telerik.com/fiddler/web-debugging
to debug the request and response

Related

Images from http site on https site: mixed mode

My https-based site (site A) uses images from http-based site B. I causes mixed-content error. To fix this, I found solution to swap each external link like http://www.siteB.com/imageX.png with my controller method which do forward to external image. The new link format is:
The code of method /api/misc/forward is following:
[HttpGet]
public async Task<HttpResponseMessage> Forward(string url)
{
HttpResponseMessage httpResponseMessage = new HttpResponseMessage();
try
{
var response = Request.CreateResponse(HttpStatusCode.Found);
response.Headers.Location = new Uri(HttpUtility.UrlDecode(url));
return response;
}
catch (Exception ex)
{
httpResponseMessage.StatusCode = HttpStatusCode.NotFound;
_loggerService.LogException(ex, url);
}
return httpResponseMessage;
}
but the browser still is able to recognize it as mixed mode.... Why?
The original image links sent to browser origins from https-based site.
Any quick tip for it? I dont want to cache all images from site B:).
Because your code sends back a redirect to another location, so, eventually, the browser still go to the HTTP image.
What happens is that your browser calls the controller in HTTPS, then controller action sends back a redirect command to the browser, the browser retrieves the image from the new location that you set in the response.Headers.Location.
If you want to avoid the mixed mode, then you need to retrieve the image from the controller and return a FileResult from the action, this way, the browser will not have to access the HTTP site.
Another approach, would be to just copy the images to you site.

Invalid PUT method from Webforms to Web API 2 (Azure)

I have a Web API in my Azure server and I'm making calls from an ASP.NET Webforms website.
I seem to be able to perform GET with no trouble. Now for the PUT, it's giving me this error:
The page you are looking for cannot be displayed because an invalid
method (HTTP verb) is being used
I was not able to DELETE either. I see some other topics where people disable some WebDav and stuff on their IIS servers and it works. But on Azure?
Below my code for the PUT:
HttpResponseMessage response = client.GetAsync("api/People/" + id).Result;
if (response.IsSuccessStatusCode)
{
var yourcustomobjects = response.Content.ReadAsAsync<People>().Result;
Uri peopleUrl = response.Headers.Location;
yourcustomobjects.name= "Bob";
response = await client.PutAsJsonAsync(peopleUrl, yourcustomobjects);
tbDebug.Text += await response.Content.ReadAsStringAsync();
}
Alright I grew tired of trying to fix this issue by enabling PUT.
So what I did, was I wrote a GET that makes the needed change in the database.
Cheers

Why does my WebClient return a 404 error most of the time, but not always?

I want to get information about a Microsoft Update in my program. However, the server returns a 404 error at about 80 % of the time. I boiled the problematic code down to this console application:
using System;
using System.Net;
namespace WebBug
{
class Program
{
static void Main(string[] args)
{
while (true)
{
try
{
WebClient client = new WebClient();
Console.WriteLine(client.DownloadString("https://support.microsoft.com/api/content/kb/3068708"));
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
Console.ReadKey();
}
}
}
}
When I run the code, I have to get through the loop a few times until I get an actual response:
The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
The remote server returned an error: (404) Not found.
<div kb-title title="Update for customer experience and diagnostic telemetry [...]
I can open and force refresh (Ctrl + F5) the link in my browser as often as I want to, but it'll show fine.
The problem occurs on two different machines with two different internet connections.
I've also tested this case using the Html Agility Pack, but with the same result.
The problem does not occur with other websites. (The root https://support.microsoft.com works fine 100 % of the time)
Why do I get this weird result?
Cookies. It's because of cookies.
As I started to dig into this problem I noticed that the first time I opened the site in a new browser I got a 404, but after refreshing (sometimes once, sometimes a few times) the site continued to work.
That's when I busted out Chrome's Incognito mode and the developer tools.
There wasn't anything too fishy with the network: there was a simple redirect to the https version if you loaded http.
But what I did notice was the cookies changed. This is what I see the first time I loaded the page:
and here's the page after a (or a few) refreshes:
Notice how a few more cookie entries got added? The site must be trying to read those, not finding them, and "blocking" you. This might be a bot-prevention device or bad programming, I'm not sure.
Anyways, here's how to make your code work. This example uses the HttpWebRequest/Response, not WebClient.
string url = "https://support.microsoft.com/api/content/kb/3068708";
//this holds all the cookies we need to add
//notice the values match the ones in the screenshot above
CookieContainer cookieJar = new CookieContainer();
cookieJar.Add(new Cookie("SMCsiteDir", "ltr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("SMCsiteLang", "en-US", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smc_f", "upr", "/", ".support.microsoft.com"));
cookieJar.Add(new Cookie("smcexpsessionticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcexpticket", "100", "/", ".microsoft.com"));
cookieJar.Add(new Cookie("smcflighting", "wwp", "/", ".microsoft.com"));
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
//attach the cookie container
request.CookieContainer = cookieJar;
//and now go to the internet, fetching back the contents
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
string site = sr.ReadToEnd();
}
If you remove the request.CookieContainer = cookieJar;, it will fail with a 404, which reproduces your issue.
Most of the legwork for the code example came from this post and this post.

400 bad request invalid hostname only in android application

I have a asp.net mvc website deployed on a server, providing a few web interfaces to others. For example, getting the current user's information, my test C# console application looks like this:
using (var client = new WebClient())
{
try
{
var url = "http://api.fake.mysite.com/v1.0/user/current";
var token = "e0034e1c082de62b74e361b15f9c6471";
var encoded = Convert.ToBase64String(Encoding.UTF8.GetBytes(token));
client.Headers["Authorization"] = encoded;
client.Headers["Content-Type"] = "application/json";
Console.WriteLine(client.DownloadString(url));
}
catch (WebException e)
{
//log the exception
}
}
You can see the usage is pretty simple, just request the url via HTTP_GET, set the Authorization header to the encoded token. Actually it works fine in my machine. But some one else meets a strange issue when visiting this url in an android application, here is the java code:
HttpClient httpClient = new DefaultHttpClient();
String token = "e0034e1c082de62b74e361b15f9c6471";
String url = "http://api.fake.mysite.com/v1.0/user/current";
HttpGet httpGet = new HttpGet(url);
String encoded = Base64.encodeToString(token.getBytes(), Base64.DEFAULT);
httpGet.addHeader("Authorization", encoded);
httpGet.addHeader("Content-Type", "application/json");
try {
HttpResponse httpResponse = httpClient.execute(httpGet);
int responseCode = httpResponse.getStatusLine().getStatusCode();
String response = EntityUtils.toString(httpResponse.getEntity());
} catch (ClientProtocolException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
then he got "400 bad request invalid host name" error. I've tried:
(1) make sure the variable "encoded" has the same value in C# and Java code.
(2) make sure the website's domain name is correctly set in server IIS
(3) all PCs/mobile phones can visit the test index page(http://api.fake.mysite.com)
(4) ping api.fake.mysite.com works fine
(5) if removing httpGet.addHeader("Authorization", encoded);, the Java program got a 401 Unauthorized result as expected(the server code under my control returns the result)
(6) some other applications using C# and PHP can use the web methods well, only android application can't(tested in two totally different android mobile phones, the android emulator got 400 invalid host name either)
(7) use IP instead of domain name http://xx.xx.xx.xx/v1.0/user/current, everything is the same. (xx.xx.xx.xx stands for the ip address)
(8) checked the IIS log, all requests to /v1.0/user/current returns 200/401/500, no 400 results.
(9) make sure the android application has internet permissions(actually we've added all permissions)
Does anyone know the reason or help to find the reason? Thank you very much, this issue is driving me crazy.
Should be httpGet.addHeader("Authorization", "basic " + encoded); and String encoded = Base64.encodeToString(token.getBytes(), Base64.NO_WRAP);
I struggled the very same problem. I can send HTTP POST from Fiddler or any other tool to my asp.net web API in debug mode but I can not access from my android application.
I tried to be sure to connect from my computer browser to
web API interface.
I tried to be sure to connect from android emulator web
browser(AEWB). And then I deployed my web api to IIS so I can get certain address to access from AEWB.
I can accessed to this adres from my AEWB
http://10.0.0.2:8088/api/tran
http://10.0.0.2 -> this is your local host address seen from Android
8088 -> this is your port of web api hosted on IIS
/api -> this is web api
/tran -> this is your controller

HTTPS C# Post?

I am trying to login to a HTTPS website and then navigate to download a report using c# (its an xml report) ?
I have managed to login OK via cookies/headers etc - but whenever I navigate to the link once logged in, my connection takes me to the "logged out" page ?
Anyone know what would cause this ?
Make sure the CookieContainer you use for your login is the same one you use when downloading the actual report.
var cookies = new CookieContainer();
var wr1 = (HttpWebRequest) HttpWebRequest.Create(url1);
wr1.CookieContainer = cookies;
// do login here with wr1
var wr2 = (HttpWebRequest) HttpWebRequest.Create(url2);
wr2.CookieContainer = cookies;
// get the report with wr2
It can be any number of reasons. Did you pass in the cookie to the download request? Did you pass a referrer URL?
The best way to check is to record a working HTTP request with Wireshark or any number of Firefox extensions or Fiddler.
Then try to recreate the request in C#

Categories