Use HttpWebRequest to download web pages without key sensitive issues - c#

Use HttpWebRequest to download web pages without key sensitive issues

[update: I don't know why, but both examples below now work fine! Originally I was also seeing a 403 on the page2 example. Maybe it was a server issue?]
First, WebClient is easier. Actually, I've seen this before. It turned out to be case sensitivity in the url when accessing wikipedia; try ensuring that you have used the same case in your request to wikipedia.
[updated] As Bruno Conde and gimel observe, using %27 should help make it consistent (the intermittent behaviour suggest that maybe some wikipedia servers are configured differently to others)
I've just checked, and in this case the case issue doesn't seem to be the problem... however, if it worked (it doesn't), this would be the easiest way to request the page:
using (WebClient wc = new WebClient())
{
string page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Algeria");
string page2 = wc.DownloadString("http://en.wikipedia.org/wiki/%27Abadilah");
}
I'm afraid I can't think what to do about the leading apostrophe that is breaking things...

I also got strange results ... First, the
http://en.wikipedia.org/wiki/'Abadilah
didn't work and after some failed tries it started working.
The second url,
http://en.wikipedia.org/wiki/'t_Zand_(Alphen-Chaam)
always failed for me...
The apostrophe seems to be the responsible for these problems. If you replace it with
%27
all urls work fine.

Try escaping the special characters using Percent Encoding (paragraph 2.1). For example, a single quote is represented by %27 in the URL (IRI).

I'm sure the OP has this sorted by now but I've just run across the same kind of problem - intermittent 403's when downloading from wikipedia via a web client. Setting a user agent header sorts it out:
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");

Related

c# HttpRequestValidationException

I have a c# asp.net app running on an Amazon EC2 however I am getting a validation error:
Exception type: HttpRequestValidationException
Exception message: A potentially dangerous Request.RawUrl value was detected from the client (="...h&content=<php>die(#md5(HelloT...").
The logs show that the request url was:
http://blah.com/?a=fetch&content=<php>die(#md5(HelloThinkCMF))</php>
Where does that PHP die script come from? Is this some kind of security breach and I have no idea how to debug this.
This is due to a built-in ASP.Net feature called "Request validation" which causes an exception to be thrown to prevent attacks whenever dangerous characters are found in e.g. the query string. In this case, it is probably caused by the < character, which is forbidden to make attacks such as Cross Site Scripting harder. As such, the error indicates that the attempt to access your site was stopped before your application code was even invoked.
The query string in your example is probably generated by some automated attack script or botnet that is throwing random data at your site to try to breach it. You can safely ignore this particular instance of the attack, since you're not running PHP. That being said, as others have commented, it does indicate that someone is trying to get in, so you should consider taking appropriate security measures either in your application code or in your network/hosting setup. What these are is both out of scope for this site and hard to say without knowing a lot more about your context, however.
Those are ThinkPHP5 (Chinese PHP framework based on Laravel) RCE exploit attempts
This blog post suggests that this is a wordpress exploit that no longer works.
I am not running PHP (or Wordpress) yet my web server (apache2, log extract) returns a 200 to this (which is why I was interested):`
[04/Jun/2020:11:43:35 -0500] "GET /index.php?s=/Index/\\think\\app/invokefunction&function=call_user_func_array&vars[0]=md5&vars[1][]=HelloThinkPHP HTTP/1.1" 404 367 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"
That request came from 195.54.160.135. Jonas Høgh is correct, of course, that securing your site is something you have to figure out yourself. I have a script to block an IP on an ad hoc basis and another one to get a list of bad actors from a website and block them all. I suppose, though, that many of these attempts come from pwned machines or through Tor, and blocking an IP may be useless.
It is an attempt to see if this code is running on the server side. PHP and its CMS had such problems before, but if the site is written in .net then everything is fine you don't have to worry.

C# / .net GET unexpected behaviour

Doing a simple GET from C#
var webClient = new WebClient();
webClient.Headers.Add("Accept", "*.*");
webClient.Headers.Add("Accept-Encoding", "gzip, deflate");
webClient.Headers.Add("User-Agent", "runscope/0.1");
var response = webClient.DownloadString("http://booking.frederiksberg.dk/NetInterBook/SearchScheme/SimpleSearch.aspx");
I get a response that is different than the same thing executed from Chrome's Advanced Rest Client / Postman / http://Hurl.it
I still get a website, but it doesn't contain the form information that I am looking for (the items with id-s similar to this drplFacility_item_1).
I've tried using RestSharp and HttpWebResponse as well, with the same results. What am I not doing that these other HTTP clients are? According to Chrome's network tab, they seem to be doing pretty vanilla GET-s. Thanks!
Here's the page I get from the webclient: http://pastebin.com/5PjxejKT
It was a Visual Studio GUI bug that was tripping me up. I did use inspectors before posting this question, and I was just really baffled as to why I'm getting a different response for the same GET from .NET then everywhere else. Turns out, I wasn't. (Thanks WireShark!)
Here's the active bug report: https://connect.microsoft.com/VisualStudio/feedback/details/2016177/text-visualizer-misses-corrupts-text-in-long-strings
Hope this helps anyone who might come across this, it took me a long time to figure this one out...

Screen scrape that bypass older browser detection

I am trying in C# to screen scrap two airlines site so I can compare the two fares over many different dates. I manage to do on qua.com but when I try to do it on amadeus.net, I encounter that this site give me a response of
older browser not supported
So using webbrowser class doesn't work... using httpwebrequest doesnt work also.
So I want to use webclient but because amadeus.net is heavily base on js or something. I do not know where to post url.
Any suggestion?
Edit: webclient.downloadstring also doesn't wort
Try to use the Navigate overload with the user agent:
string useragent = "Mozilla/5.0 (Windows NT 6.0; rv:39.0) Gecko/20100101 Firefox/39.0" ;
webBrowser.Navigate(url, null, null,useragent) ;
An alternative is to use another WebBrowser such as awesomium
After looking into passing a fake useragent (from Jodrell) in httpWebrequest, this works but i had to deal with cookies so that can get complicated.
Graffito suggest to overload useragent within a webBrowser but didn't work as it gave me lots of JS loading error, this is because within that website it-self it requires a proper modern browser for it to work.
I found out that my IE itself is a version 9, so i upgraded it IE.11. Then tried Graffito solution again, but that didn't work.
So in the end i thought i might as well update webBrowser to the correct version by following this article

getting source code of redirected http site via c# webclient

I have problem with certain site - I am provided with list of product ID numbers (about 2000) and my job is to pull data from producer site. I already tried forming url of product pages, but there are some unknown variables that I can't put to get results. However there is search field so i can use url like this: http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen - the problem is, that given page display info (probably java script) and then redirect straight to desired page - the one that i need to pull data from.
is there any way of tracking this redirection thing?
I would like to put some of my code, but everything i got so far, i find unhelpful because it just download source of preregistered page.
public static string Download(string uri)
{
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string s = client.DownloadString(uri);
return s;
}
Also suggested answer is not helpfull in this case, because redirection doesn't come with http request - page is redirected after few seconds of loading http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen url
I just found solution, And since i'm new, and i have to wait few hours to answer my question, it will end up there:
I hope that other users will find it usefull:
{pseudocode}
webBrowser1.Navigate('url');
while (webBrowser1.Url.AbsoluteUri != 'url')
{
// wait
}
String desiredUri = webBrowser1.Url.AbsoluteUri;
Thanks for answers.
Welcome to the wonderful world of page scraping. The short answer is "you can't do that." Not in the general case, anyway, and certainly not with WebClient. The problem appears to be that some Javascript does the redirection. And since all WebClient does is download the page, it's not even going to download the Javascript. Much less parse and execute it.
You might be able to do this by creating a program that uses the WebBrowser class. You can have it load the page. It should do the redirect and then you can inspect the result, which should be the page you were looking for. I haven't actually done this, but it does seem possible.
Your other option is to fire up your Web browser's developer tools (like IE's F12 Developer Tools) and watch what's happening. You can then inspect the Javascript that's being executed as well as the modified DOM, and see where the redirect happens.
Yes, it's tedious work. But once you figure out the redirect for one page, you can probably generate the URL for the other pages you want automatically.

C# Link Analyzer getting Bad Request Errors?

I have a rather simple program which takes in a URL and spits out the first place it redirects to. Anyhow, I've been testing it on some links and noticed gets 400 errors on some urls. I tried testing such urls by pasting it into my browser and that worked fine.
static string getLoc(string curLoc, out string StatusDescription, int timeoutmillseconds)
{
HttpWebRequest x = (HttpWebRequest)WebRequest.Create(curLoc);
x.UserAgent = "Opera/9.52 (Windows NT 6.0; U; en)";
x.Timeout = timeoutmillseconds;
x.AllowAutoRedirect = false;
HttpWebResponse y = null;
try
{
y = (HttpWebResponse)x.GetResponse(); //At this point it throws a 400 bad request exception.
I think something weird is happening with cookies. It turns out that due to the way I was testing the link, the necessary cookies for it to work were in my browser but not the link. It turns out some of the links I was testing manually (when the other links failed) were generating cookies.
It's slightly convoluted what happened but the short answer is that my browser had cookies, the program did not, maintaining the cookies between redirects did not solve the problem.
The underlying problem is caused by the fact that the link I am testing requires either an extra parameter or a cookie or both. I was trying to avoid both in my tests since the parameter/cookie were for tracking and I didn't want to break tracking.
In short, I know what the problem is but it's not a solvable problem.

Categories