The current situation is that i'm using PhantomJS and Selenium to load web pages, because the host website is behind cloudflare ddos protection so I can't use anything that doesn't have javascript built in. This has been working well for a while but the website has recently been using their own CDN to deliver these images, and this causes problems when setting PictureBox.ImageLocation to the src.
If there any way to get an <img> tags src, and convert that to bitmap or image to be able to use the image directly from PhantomJS in my picturebox, that'd be awesome.
Thanks for the help.
For those whom are in the same situation as me:
It turns out that it wasn't that easy to store appropriate caching for PhantomJS and selenium, so I turned to an alternative route which ended up working.
When PhantomJS accesses your website that is locked behind a JS wall, (such as CloudFlare DDOS Protection), it will most likely store a cookie with an auth token of sorts saying that your browser passes the test.
At first, it didn't work for me, because it seems CloudFlare also logs which User Agent has auth'd for that token, and any mismatch will discard the token used.
I managed to solve this using the following piece of code:
private Image GetImage(string ImageLocation)
{
byte[] data = null;
using (CustomWebClient WC = new CustomWebClient())
{
WC.Headers.Add(System.Net.HttpRequestHeader.UserAgent, "Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/601.1 (KHTML, like Gecko) CriOS/53.0.2785.109 Mobile/14A403 Safari/601.1.46");
WC.Headers.Add(System.Net.HttpRequestHeader.Cookie, "cf_clearance=" + PhantomObject.Manage().Cookies.GetCookieNamed("cf_clearance").Value);
data = WC.DownloadData(ImageLocation);
}
Bitmap MP = new Bitmap(new System.IO.MemoryStream(data));
data = null;
return MP;
}
In this code, PhantomObject is my PhantomJS driver object, and CustomWebClient is just a normal website with a bit of adjusting for the website I was using.
I essentially use the same faked user agent that my PhantomJS driver was using, as well as passed over in the headers the CloudFlare clearance cookie, and from there my webclient was able to successfully access the websites data and download the image's data, which I then turned into a bitmap and returned back.
Related
I use the gmap.net library for my WinForms application and connection through the proxy.
Here is the code:
mapCtrl.MapProvider = GMap.NET.MapProviders.GMapProviders.GoogleMap;
GMaps.Instance.Mode = AccessMode.ServerOnly;
GMap.NET.MapProviders.GMapProvider.WebProxy =
System.Net.WebRequest.GetSystemWebProxy();
GMap.NET.MapProviders.GMapProvider.WebProxy.Credentials =
System.Net.CredentialCache.DefaultCredentials;
The problem is that the information security department of my company is blocking these connections and as result, the maps tiles don't load. They ask me to give them the API URL for entering it into the white list.
Somebody knows which URL uses Gmap.net for GMapProviders.GoogleMap?
From Martin Costello's comment
Use a tool like WireShark or Fiddler on your local machine and see what URLs the application tries to access over the network.
Using Fiddler helped me.
I want to extract the number data from site, link https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml
the number in below yellow highlight image:
I want to extract the number highlighted in yellow, so I wrote this code in C#:
var html = #"https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml";
HtmlWeb web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36";
var htmlDoc = web.Load(html);
var node = htmlDoc.DocumentNode.SelectSingleNode("//*[#id='Listed_IncomeStatement_tableResult']/tbody/tr[1]/td[2]");
string strSo = node.OuterHtml;
Console.WriteLine(strSo);
but in strSo I cannot find the yellow number (19,749,872).
Could you show me the way to extract the number in that website???
Sorry I write English not well.
You got a problem over this because the website is loading the data into the table via an AJAX request after the page is loaded, but HtmlAgilityPack can only download what the server directly send you.
You can find out this by just looking at the source it downloads via HtmlWeb; in fact, the DocumentNode HTML in the Table tag with id "Listed_IncomeStatement_tableResult" has no data in tbody.
To avoid this problem, you should use Selenium WebDriver.
This extension allows to use some browser behaviour (Firefox or Chrome for example) that will execute the complete page with all the javascript inside of it, and then give you back the complete source of the page after it has been executed.
Here you can find the driver to use Chrome: Chrome Driver
After you imported all the libraries, you will have only to execute the following code:
//!Make sure to add the path to where you extracting the chromedriver.exe:
IWebDriver driver = new ChromeDriver(#"Path\To\Chromedriver");
driver.Navigate().GoToUrl("https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml");
After that, you will be able to access to the webpage directly from driver object like:
IWebElement myField = driver.FindElementBy.Id("tools"));
The only problem you get with Chromedriver is that it will open up a browser to render everything. To avoid this, you can try to use another driver like PhantomJS, that will do the same as Chrome but will not open any window.
To have more example on how to use Selenium WebDriver with C#, I reccomend you to get a look at:
Selenium C# tutorial
I am trying in C# to screen scrap two airlines site so I can compare the two fares over many different dates. I manage to do on qua.com but when I try to do it on amadeus.net, I encounter that this site give me a response of
older browser not supported
So using webbrowser class doesn't work... using httpwebrequest doesnt work also.
So I want to use webclient but because amadeus.net is heavily base on js or something. I do not know where to post url.
Any suggestion?
Edit: webclient.downloadstring also doesn't wort
Try to use the Navigate overload with the user agent:
string useragent = "Mozilla/5.0 (Windows NT 6.0; rv:39.0) Gecko/20100101 Firefox/39.0" ;
webBrowser.Navigate(url, null, null,useragent) ;
An alternative is to use another WebBrowser such as awesomium
After looking into passing a fake useragent (from Jodrell) in httpWebrequest, this works but i had to deal with cookies so that can get complicated.
Graffito suggest to overload useragent within a webBrowser but didn't work as it gave me lots of JS loading error, this is because within that website it-self it requires a proper modern browser for it to work.
I found out that my IE itself is a version 9, so i upgraded it IE.11. Then tried Graffito solution again, but that didn't work.
So in the end i thought i might as well update webBrowser to the correct version by following this article
I have problem with certain site - I am provided with list of product ID numbers (about 2000) and my job is to pull data from producer site. I already tried forming url of product pages, but there are some unknown variables that I can't put to get results. However there is search field so i can use url like this: http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen - the problem is, that given page display info (probably java script) and then redirect straight to desired page - the one that i need to pull data from.
is there any way of tracking this redirection thing?
I would like to put some of my code, but everything i got so far, i find unhelpful because it just download source of preregistered page.
public static string Download(string uri)
{
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
string s = client.DownloadString(uri);
return s;
}
Also suggested answer is not helpfull in this case, because redirection doesn't come with http request - page is redirected after few seconds of loading http://www.hansgrohe.de/suche.htm?searchtext=10117000&searchSubmit=Suchen url
I just found solution, And since i'm new, and i have to wait few hours to answer my question, it will end up there:
I hope that other users will find it usefull:
{pseudocode}
webBrowser1.Navigate('url');
while (webBrowser1.Url.AbsoluteUri != 'url')
{
// wait
}
String desiredUri = webBrowser1.Url.AbsoluteUri;
Thanks for answers.
Welcome to the wonderful world of page scraping. The short answer is "you can't do that." Not in the general case, anyway, and certainly not with WebClient. The problem appears to be that some Javascript does the redirection. And since all WebClient does is download the page, it's not even going to download the Javascript. Much less parse and execute it.
You might be able to do this by creating a program that uses the WebBrowser class. You can have it load the page. It should do the redirect and then you can inspect the result, which should be the page you were looking for. I haven't actually done this, but it does seem possible.
Your other option is to fire up your Web browser's developer tools (like IE's F12 Developer Tools) and watch what's happening. You can then inspect the Javascript that's being executed as well as the modified DOM, and see where the redirect happens.
Yes, it's tedious work. But once you figure out the redirect for one page, you can probably generate the URL for the other pages you want automatically.
Im using some basic code to display a mobile website in my application using a web browser.
For some reason, if I use the standard browser it is sized correcly to 480 x 800 , but when I use the web browser in my application the page is way off more like 960 x 800.
Is there a way to force the size the page is displayed in the web browser?
The funny thing is that it was working a few months ago, but has suddenly gone haywire.
code is :
string site1 = "http://m.domain.com";
webBrowser1.Navigate(new Uri(site1, UriKind.Absolute));
webBrowser1.Navigated += new EventHandler<System.Windows.Navigation.NavigationEventArgs>(webBrowser1_Navigated);
I was thinking I could force user agent by using the below code, but I am getting errors " no overload for method, 'Navigate' takes 4 arguments.
webBrowser1.Navigate("http://localhost/run.php", null, null, "User-Agent: Windows Phone 7");
Page using app:
Page using standard IE9 browser outside of application, but on handset.
One place to start would be to sniff the request/response. There's clearly different HTML coming back from the server for the two requests. For the in-app browser, it clearly says it's having trouble identifying the device. If you can figure out what is different between the two requests, you might be able to force the in-app browser to make the request more like the out-of-app browser.
If you're using the simulator, Fiddler is a fantastic tool for that sort of thing. The first place I'd look is at the User-Agent header, which most sites use to figure out what type of browser is requesting the page.