i'm using phantomjsdriver 1.8.1 for .net (C#)
http://www.nuget.org/packages/phantomjs.exe/ and wonder how to add user-agent firefox before loading the web content
Although Cybermax's answer is somewhat correct, it isn't correct in terms of what you are actually using - C#. To specify a user agent for the PhantomJSDriver in C#, you will need to give it as an "additional capability":
var options = new PhantomJSOptions();
options.AddAdditionalCapability("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0");
These options should be passed into the constructor used to create the driver:
var driver = new PhantomJSDriver(options);
To verify this has been set correctly, you can check against a website that tells you the user agent of your browser, something like WhatIsMyUserAgent.com or look closely at the PhantomJS console window, it'll have a "useragent" value there, and you should see it's been set to what you specified above.
In your script, you have to define property page.settings.userAgent before the first call to page.open.
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0'; //firefox 25
page.open('http://stackoverflow.com/', function (status) {
// do something
});
Note : the last version of PhantomJS is 1.9.2. Another package is available here.
Related
I'm using the following code to download the given web page.
using (WebClient client = new WebClient())
{
client.Headers.Add("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
using (Stream data = client.OpenRead("https://www.yellowpages.com/south-jordan-ut/plumbers?page=5"))
using (StreamReader reader = new StreamReader(data))
{
string html = reader.ReadToEnd();
}
}
It works but html appears to contain only a small portion of the final page. I understand that this site has dynamic content, but when I view the source code in Chrome (by selecting the View page source command), it appears to download everything.
Unless Chrome actually runs scripts when you run View page source, it seems to be using some other trick to get all the HTML. Does anyone know what that trick might be?
So if you read the HTML that the webClient is returning, you can see some text:
We can provide you with the best experience on Yellowpages.com, if you upgrade to the latest version of your browser"
If you change your user-agent to something that Chrome would send, you get the results as expected:
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36");
There's probably an ajax call or something similar to load the page data. It's a modern software paradigm whereas before the page would contain all the data already for the data. Whet everyone else is saying is that if there IS javascript to load the content then the webclient would not be able to load this content. This is why you see it in your browser and not the webclient.
The solution is to use another tool like Selenium to download the page into a rendering engine & then scrape what you need.
I am trying to get the html source of a youtube video using cURL command line but I need it to be without https/ssl.
My problem is that I must use the compiled version of cURL with SSL/SSH.
I am using the following command:
curl --user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36" -L -x http://my.foo.proxy:8080 http://youtube.com/watch?v=youtubevideo > html.html
this works but a specific part of the html source is in https (look for a really long script string inside that file. some of the links there start with httpS)
curl --proto =http --proto-redir =http --user-agent "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36" -L -x http://my.foo.proxy:8080 http://youtube.com/watch?v=youtubevideo > html.html
this command causes an error:
protocol https not supported or disabled in libcurl.
which is really weird because the curl version I am using does have ssl and I dont even want https (see the -proto and -proto-redir args).
As a test I also tried using .NET Webclient class like:
public static void DownloadString (string address)
{
WebClient client = new WebClient ();
string reply = client.DownloadString (address);
Console.WriteLine (reply);
}
and in this case I get a html source file without https.
My question is, how do I get a html source file of a youtube video using cURL without https inside my html source file like when I use .NET/Webclient?
using an user agent without firefox fixes this issue when used inside a console:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101
When used with a binding set SSL_VERIFYPEER to false and SSL_VERIFYHOST to 0. Allows man in the middle attack but if that is the only option...
In addition HTTPGET and FOLLOWLOCATION should also both be set to true.
i had created an empty c# web site with just one page with Request.Browser.Version & UserAgent output on it. Then hit it with different Chrome versions using "User-Agent Switcher" Chrome extension.
For time to time, though the Request.UserAgent is correct, Request.Browser.Version seems to return wrong value:
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.16 Safari/537.36" Returned Request.Browser.Version:39
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2272.16 Safari/537.36" Returned Request.Browser.Version:41
So yes, .net 4.5 caches the user agent by its first 64 chars. And that's just gets them before the version number. So the next user with the same browser but with a different version will get the wrong browser version and so.
To solve it just change the :browserCaps userAgentCacheKeyLength="...", as can be seen here:
.Net 4.0 website cannot identify some AppleWebKit based browsers
How isn't this stupid Microsoft bug on the headlines?
I have some code (in a Winform app) that reads this URL using HttpWebRequest.GetResponse().
For some reason, it recently starts returning 500 Internal Error when requested from my app.
(The response contains some HTML for the navigations, but doesn't have the main content I need)
On Firefox/Chrome/IE, it is still returning 200 OK.
The problem is I don't have control over their code, I don't know what it does on the backend that causes it to break when requested from my app.
Is there a way I can "pretend" to make the request from, say, Google Chrome? (just to avoid the error)
Set the HttpWebRequest.UserAgent property to the value of a real browser's user agent.
HttpWebRequest webRequest = (HttpWebRequest) WebRequest.Create("http://example.com");
webRequest.UserAgent = #"Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36";
I've been playing around with Selenium and PhantomJS in C# but I want to be able to fake my User Agent to be this:
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0
Instead of:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.34 (KHTML, like Gecko) PhantomJS/1.9.1 Safari/534.34
Is it possible to modify the HTTP headers of PhantomJS with Selenium to achieve this? If so, how?
Thanks in advance.
I found the answer:
PhantomJSOptions options = new PhantomJSOptions();
options.AddAdditionalCapability("phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0");
IWebDriver driver = new PhantomJSDriver(options);
Thanks.