How to crawl a website that uses cookies while integrating IP proxy? - c#

I'm creating a crawler which uses several IP Proxies. Whenever I tried to crawl the website without proxy, I'm able to get the html source, but when I tried to enable the ip proxy, it always fail and throws an exceptions (The remote server returned an error: (403) Forbidden.)
Upon looking at the fiddler, it seems the website stores cookies upon visit. But if the proxy is enabled, it fails at get response part.
I don't understand why the cookies was not set using a proxy? Is it the proxy server settings for cookies that cause it? or I can do something about it while still enabling proxy?
I'm using C# by the way, but the question doesn't seems language dependent.

Another thing to consider is that you set a cookie from the ip address of the non proxied machine (which worked), then when you sent another request with the same cookie from another ip address which might have gotten you blocked.
Some network level software looks at stuff like this which might have flagged you as a malicious crawler or annonymous tor browser.

Related

How do I defend against Request Header Alteration?

I am not sure of what it is called, but what happens is that my POST method request can be captured by a tool like (burp suite) and change the POST to GET.
Afterwards the process would still continue, but now it shows the parameters and its value in the URL.
How do I defend against this kind of attack?
The website is on ASP.NET C#.
Burp suite is a "man in the middle" (MITM) proxy with injection/manipulation capability. If your site is on http (not https), then yes: you are completely at the mercy of every intermediary that the traffic passes through. Change your site immediately to use https with a valid certificate.
For this to work on https, you need to deliberately break your machine, by installing a dodgy root certificate authority that will issue fake certificates for the sites it wants to MITM. This only passes your browser's security system because you broke your machine.
An attack that depends on the client already having been compromised is not an interesting attack from a server perspective. All you can reasonably do is protect intact clients. By using https and disabling http (non-TLS). You can do things like reject GET if you're expecting POST - but this doesn't change that the GET will have happened. But note:
the MITM proxy can already read the POST variables without needing to change them to GET: it is in complete control of the data
other intermediaries between the MITM proxy and your server cannot read the data regardless of whether it is GET or POST, as long as it is https (which is why you need to disable http, not just enable https)
the only thing you're changing with GET vs POST here is what appears in your own server logs... and it doesn't matter how you respond to the request at that point: it has already been logged, even if you return 404 or 500 or whatever

How to check if proxy settings in WebClient work?

I have a WebClient object with the Proxy property set.
But when using the WebClient object the communication is transparent so there is no proxy in effect.
How do I check programmatically during runtime if (for example before downloading a file with that WebClient object) the proxy connection works?
If I understand your question:
You could set up a Proxy Server on your computer, such as CCProxy (Or something similar). Point your WebClient application to that proxy server. Then enable logging on CCProxy to see if the traffic you expected is passing through.
EDIT
Are you in a network that restricts internet access unless you are using a proxy server?
If your network supports it, you could look into Automatic Proxy Detection https://msdn.microsoft.com/en-us/library/fze2ytx2(v=vs.110).aspx
When automatic proxy detection is enabled, the system attempts to locate a proxy configuration script that is responsible for returning the set of proxies that can be used for the request. If the proxy configuration script is found, the script is downloaded, compiled, and run on the local computer when proxy information, the request stream, or the response is obtained for a request that uses a WebProxy instance.
The reason it is hard to know if the problem is proxy settings is; when your app tries to connect to the internet, it cannot possibly know or guess that the reason the URL is not accessible is because you need to use a proxy. So it throws a general exception. Because there are many reasons why a URL might not be accessible, such as, Internet service is down, network misconfiguration, or like in your case, a proxy setting is required. Your app would be guessing as to which one of those reasons the URL is inaccessible.

Webservice taking long time to response

I have a client server application. My server is in PHP, Mysql and Apache and client is developed using C# windows Form. I have SOAP WSDL webservice for client server communication.
Recently I found a problem, when my client sends the request, response comes after very long time (like 3/4 mins to some hrs.), and sometimes I never gets response. I have checked all the timeout value in client (httpwebrequest timeout, readwritetimeout) as well as server(timeout, keepalivetimeout) side, the max value I have is 5mins (for httpwebrequest readwritetimeout). So can anybody tell what would be the problem? Why it is taking hours to get response or not geting any response?
In my experience, problems like these come with the web-service connection being blocked by a firewall, or a wonky proxy in the way. Check that this is not the issue.
You should first begin be locating the problem by narrowing down the options. Have you tried calling the web service on the server locally, see if you get the same problem - if you don't then it is with high likeliness certainly a connection problem.
To also rule out the client having problems, try using something like http://www.soapui.org/ instead to call your server web service.
Where are you calling the server from? Are you sure the device you call it from is not being IP-blocked, and are you sure your web service is able to access its database from where it is being run.
Does the MySQL user defined for your server API to use have access from the IP of the server. MySQL users are often blocked by IP as well.
If you're running it all locally, are you sure your IIS Express settings/virtual folders are not jumbled up and the URLs are resolving wrong. Try creating the virtual folder again to rule out this. Even when running locally, remember to check that the MySQL user has access from your local IP.
Here's a few things I usually check when I have issues like yours.

Check proxy type

I'd like to determine whether the proxy at a given IP address is transparent or anonymous. Transparent proxies connect to websites with your real IP in headers like HTTP_X_FORWARDED_FOR or HTTP_VIA. I would like to check these proxies, but all solutions I found are developed to work on server side, to test incoming connections for proxyness. My plan is to make a web request to an example page via the proxy. How do I check the headers sent by the proxy, preferably using the WebRequest class?
EDIT: So is there some free web API that will allow me to do this? I'm not keen on setting up a script on my own small server that will be bombarded with requests.
Simply you don't need that headers. I could check transparency of a proxy by sending request to any get-my-IP site, if it returns my IP then it is transparent. If not then the proxy is anonymous. So steps are:
send request to any get-my-IP site without proxies
extract the IP from response as my local IP address
send new request to any get-my-IP site with the proxy
extract the IP from response and compare it with my local IP (step 2)
if(LocalIp==ProxyIp) then the proxy is transparent else it is anonymous
That is technically impossible since the client only sees what the proxy returns back to the client - the proxy can do whatever it wants when communicating with the target server and transform your request and the answer from the server anyway it wants...
To really know what the proxy does you NEED see what the server gets and sends back without any interference from the proxy...
The reason all solutions are server side is that the headers you're talking about are only passed from the proxy to the server and never back to the client again in the response.
In other words, if you plan to check for HTTP headers in the request from the proxy to the server, you either need to check them server side (as the solutions you found do) or actively pass them right back in the response to the client to check.
Either way, you can't just make a request to a random page and check the headers the server gets, the server needs to be involved in some way.

HttpWebRequest losing cookies

I have a client application that is communicating with an ASP.NET web service using cookie-based authentication. (the clients call a login method which sets a persistent cookie that is then re-used across requests within the logon session).
This has been working fine for some time, but I have started getting error reports from a few machines used by a new client that seem to indicate that the cookie has not been successfully roundtripped. (Login requests are all successful, but all subsequent requests fail with a 302-redirect to the logon resource)
I am using a CookieContainer which I manually attach to each HttpWebRequest I am using to ensure that the cookies are preserved across every request.
Is it possible that there is some "security" software on these machines that is intercepting/blocking the cookie transmission? (I am using SSL). If so, is there anything that can be done to tell what is getting in the way?
It's highly unlike that security software can even see inside your packet if you're using SSL. SSL data should be encrypted even before they get into packet form; generally they are encrypted even before using the send() to the socket. Unless you have some awesome security software that has broken SSL encryption and can look inside the packet, this shouldn't be possible.
Are the same machines failing every time? Or are some machines failing randomly at times, and others failing at other times? If it's the latter, maybe there's something going on on the server, not the clients.

Categories