I am trying in C# to screen scrap two airlines site so I can compare the two fares over many different dates. I manage to do on qua.com but when I try to do it on amadeus.net, I encounter that this site give me a response of
older browser not supported
So using webbrowser class doesn't work... using httpwebrequest doesnt work also.
So I want to use webclient but because amadeus.net is heavily base on js or something. I do not know where to post url.
Any suggestion?
Edit: webclient.downloadstring also doesn't wort
Try to use the Navigate overload with the user agent:
string useragent = "Mozilla/5.0 (Windows NT 6.0; rv:39.0) Gecko/20100101 Firefox/39.0" ;
webBrowser.Navigate(url, null, null,useragent) ;
An alternative is to use another WebBrowser such as awesomium
After looking into passing a fake useragent (from Jodrell) in httpWebrequest, this works but i had to deal with cookies so that can get complicated.
Graffito suggest to overload useragent within a webBrowser but didn't work as it gave me lots of JS loading error, this is because within that website it-self it requires a proper modern browser for it to work.
I found out that my IE itself is a version 9, so i upgraded it IE.11. Then tried Graffito solution again, but that didn't work.
So in the end i thought i might as well update webBrowser to the correct version by following this article
Related
The current situation is that i'm using PhantomJS and Selenium to load web pages, because the host website is behind cloudflare ddos protection so I can't use anything that doesn't have javascript built in. This has been working well for a while but the website has recently been using their own CDN to deliver these images, and this causes problems when setting PictureBox.ImageLocation to the src.
If there any way to get an <img> tags src, and convert that to bitmap or image to be able to use the image directly from PhantomJS in my picturebox, that'd be awesome.
Thanks for the help.
For those whom are in the same situation as me:
It turns out that it wasn't that easy to store appropriate caching for PhantomJS and selenium, so I turned to an alternative route which ended up working.
When PhantomJS accesses your website that is locked behind a JS wall, (such as CloudFlare DDOS Protection), it will most likely store a cookie with an auth token of sorts saying that your browser passes the test.
At first, it didn't work for me, because it seems CloudFlare also logs which User Agent has auth'd for that token, and any mismatch will discard the token used.
I managed to solve this using the following piece of code:
private Image GetImage(string ImageLocation)
{
byte[] data = null;
using (CustomWebClient WC = new CustomWebClient())
{
WC.Headers.Add(System.Net.HttpRequestHeader.UserAgent, "Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/53.0.2785.109 Mobile/14A403 Safari/601.1.46");
WC.Headers.Add(System.Net.HttpRequestHeader.Cookie, "cf_clearance=" + PhantomObject.Manage().Cookies.GetCookieNamed("cf_clearance").Value);
data = WC.DownloadData(ImageLocation);
}
Bitmap MP = new Bitmap(new System.IO.MemoryStream(data));
data = null;
return MP;
}
In this code, PhantomObject is my PhantomJS driver object, and CustomWebClient is just a normal website with a bit of adjusting for the website I was using.
I essentially use the same faked user agent that my PhantomJS driver was using, as well as passed over in the headers the CloudFlare clearance cookie, and from there my webclient was able to successfully access the websites data and download the image's data, which I then turned into a bitmap and returned back.
When I navigate to some webpages they say update your browser. Obviously i am telling them i have an outdated browser or lack of a new one. I bypassed it and it ran normally in the browser anyways.
Is there anyway i can say i am an updated browser?
I dont know the proper name for it so i call it it's identity. If i could set it to be the latest version of chrome or internet explorer or the latest version of my own, that would be great.
If not, how can i do it with httpwebrequest?
Edit:
SUCCESS
By using http://www.whatismybrowser.com/what-is-my-user-agent
with chrome, I took that string and put it into my web browser.
Obligatory success photo : http://postimg.org/image/5qx4s8opd/
Thanks for the help
It is the UserAgent string you should set.
I have problem with certain site - I am provided with list of product ID numbers (about 2000) and my job is to pull data from producer site. I already tried forming url of product pages, but there are some unknown variables that I can't put to get results. However there is search field so i can use url like this: http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen - the problem is, that given page display info (probably java script) and then redirect straight to desired page - the one that i need to pull data from.
is there any way of tracking this redirection thing?
I would like to put some of my code, but everything i got so far, i find unhelpful because it just download source of preregistered page.
public static string Download(string uri)
{
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string s = client.DownloadString(uri);
return s;
}
Also suggested answer is not helpfull in this case, because redirection doesn't come with http request - page is redirected after few seconds of loading http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen url
I just found solution, And since i'm new, and i have to wait few hours to answer my question, it will end up there:
I hope that other users will find it usefull:
{pseudocode}
webBrowser1.Navigate('url');
while (webBrowser1.Url.AbsoluteUri != 'url')
{
// wait
}
String desiredUri = webBrowser1.Url.AbsoluteUri;
Thanks for answers.
Welcome to the wonderful world of page scraping. The short answer is "you can't do that." Not in the general case, anyway, and certainly not with WebClient. The problem appears to be that some Javascript does the redirection. And since all WebClient does is download the page, it's not even going to download the Javascript. Much less parse and execute it.
You might be able to do this by creating a program that uses the WebBrowser class. You can have it load the page. It should do the redirect and then you can inspect the result, which should be the page you were looking for. I haven't actually done this, but it does seem possible.
Your other option is to fire up your Web browser's developer tools (like IE's F12 Developer Tools) and watch what's happening. You can then inspect the Javascript that's being executed as well as the modified DOM, and see where the redirect happens.
Yes, it's tedious work. But once you figure out the redirect for one page, you can probably generate the URL for the other pages you want automatically.
I use Process.Start("firefox.exe", "http://localhost/page.aspx");
And how i can know page fails or no?
OR
How to know via HttpWebRequest, HttpWebResponse page fails or not?
When i use
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("somepage.aspx");
HttpWebResponse loWebResponse = (HttpWebResponse)myReq.GetResponse();
Console.Write("{0},{1}",loWebResponse.StatusCode, loWebResponse.StatusDescription);
how can I return error details?
Not need additional plugins and frameworks. I want to choose this problem only by .net
Any Idea please
Use Watin to automate firefox instead of Process.Start. Its a browser automation framework that will let you monitor what is happening properly.
http://watin.sourceforge.net/
edit: see also Google Webdriver http://google-opensource.blogspot.com/2009/05/introducing-webdriver.html
If you are spawning a child-process, it is quite hard and you'd probably need to use each browser's specific API (it won't be the same between FF and IE, for example).
It doesn't help that in many cases the exe detects an existing instance and forwards the request there (so you can't trust the exit-code, since the page hasn't even been requested in the right exe yet).
Personally, I try to avoid assuming any particular browser for this scenario; just launch the url:
Process.Start("http://somesite.com");
This will use the user's default browser. You have to hope it appears though - you can't (reliably and robustly) check that externally without lots of work.
One other option is to read the data yourself (WebClient.Download*) - but this may have issues with complex cookies, login, user-agent awareness, etc.
Use HttpWebRequest class or WebClient class to check this. I don't think Process.Start will return something if the URL not exists.
Don't start the page in this form. Instead, create a local http://localhost:<port>/wrapper.html which loads http://localhost/page.aspx and then either http://localhost:<port>/pass.html or http://localhost:<port>/fail.html. localhost: is a trivial HTTP server interface implemented by your app.
The idea is that Javascript gives you an API inside the browser, which is far more standard than the APIs on the outside of browsers. Since the Javascript on wrapper.html comes from the same server and even port as the subsequent resources, this should satisfy the same-origin policies in current browsers.
Use HttpWebRequest to download web pages without key sensitive issues
[update: I don't know why, but both examples below now work fine! Originally I was also seeing a 403 on the page2 example. Maybe it was a server issue?]
First, WebClient is easier. Actually, I've seen this before. It turned out to be case sensitivity in the url when accessing wikipedia; try ensuring that you have used the same case in your request to wikipedia.
[updated] As Bruno Conde and gimel observe, using %27 should help make it consistent (the intermittent behaviour suggest that maybe some wikipedia servers are configured differently to others)
I've just checked, and in this case the case issue doesn't seem to be the problem... however, if it worked (it doesn't), this would be the easiest way to request the page:
using (WebClient wc = new WebClient())
{
string page1 = wc.DownloadString("http://en.wikipedia.org/wiki/Algeria");
string page2 = wc.DownloadString("http://en.wikipedia.org/wiki/%27Abadilah");
}
I'm afraid I can't think what to do about the leading apostrophe that is breaking things...
I also got strange results ... First, the
http://en.wikipedia.org/wiki/'Abadilah
didn't work and after some failed tries it started working.
The second url,
http://en.wikipedia.org/wiki/'t_Zand_(Alphen-Chaam)
always failed for me...
The apostrophe seems to be the responsible for these problems. If you replace it with
%27
all urls work fine.
Try escaping the special characters using Percent Encoding (paragraph 2.1). For example, a single quote is represented by %27 in the URL (IRI).
I'm sure the OP has this sorted by now but I've just run across the same kind of problem - intermittent 403's when downloading from wikipedia via a web client. Setting a user agent header sorts it out:
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");