I need to get a list of all web-pages in web-site (all links). I have to use Phantomjs, but I never have used it before. Can anybody explain me, how I can use it? How to parse the html code with help of Phantomjs to get all links?
PhantomJS is a headless WebKit scriptable with a JavaScript API. It's redistrributed as a single executable.
Download phantomJS from the official web site
There are official release for Windows, Mac ou Linux but you can also build your own version if you want.
Create a script
PhantomJS does nothing by itself, it's just an executable. You have to code/script your action. It's done by javascript or Coffee Script.
Run the script
From the command prompt type, you just have to write
> phantomjs yourscript.js
Sometimes, your have to create a wrapper for phantomjs. Especially in WPF, use Process/ProcessStartInfo class to manage the script execution.
How to write a script ?
If your familiar with Javascript and especially Node.js developpment, the learning curve is small. The quick start could be precious, and do not hesitate to practice yourself with available examples. That's the most difficult part, but after a few scripts it will be easier.
To answer your initial question, here is a possible script
var page = require('webpage').create();
var system = require('system');
if (system.args.length != 2) {
console.log('Usage: so20189669.js <URL> ');
phantom.exit(1);
} else {
var url = system.args[1];
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
var links = page.evaluate(function () {
return [].map.call(document.querySelectorAll('a'), function (link) { return link.getAttribute('href') });
});
console.log(JSON.stringify(links));
phantom.exit();
}
});
}
In the Command Prompt :
>phantomjs.exe so20189669.js http://stackoverflow.com/questions/20189669/how-to-get-uri-with-phantomjs
There is no magic answer and you will have to alter it depending on your needs !
Related
I am using C# with Selenium for QA automation, and I am having issues with downloading an .xml file, because a prompt is always showing up asking if I want to keep the file. It also opens a second tab to execute the download, closing it after the prompt shows up.
[keep file prompt][1]
Using Chrome I do not see this behavior.
I searched all over and could not find a EdgeOptions() and/or AddArguments() capable of taking care of this issue.
Any ideas?
You need to use JS to interact with elements in another browser. I have had such experience and I used if else statement in my method to handle that problem. Just look trough the Selenium documentation, JS with selenium examples and so long so for.
Just add this to your OneTimeSetup method. Make sure to run Visual Studio as administrator. This works since Edge 105+:
public void SetEdgeXmlDownloadPolicy()
{
var keyName = "Software\\Policies\\Microsoft\\Edge\\";
var valueName = "ExemptFileTypeDownloadWarnings";
var valueData = #"{""domains"": ["" * ""], ""file_extension"": ""xml""}";
var currentUser = RegistryKey.OpenBaseKey(RegistryHive.CurrentUser, RegistryView.Registry64);
var currentKey = currentUser.OpenSubKey(keyName, true);
if (currentKey == null)
currentKey = currentUser.CreateSubKey(keyName);
if (currentKey.GetValue(valueName) == null)
currentKey.SetValue(valueName, valueData);
}
So the title says it all, I would like C# code (so please, PLEASE make sure it isn't Visual Basic code). And that is all I want to ask. I have tried the web browser built in to the .NET framework, but it looks like some old version of IE (if I am right or not). And if you answered, well thanks I guess! I need this for a small project where a bot would just log on to a website (its a base for future projects).
By default it's IE7. You can bang a registry entry in to make it later:
public static void EnsureBrowserEmulationEnabled(string exename = "YourAppName.exe", bool uninstall = false)
{
try
{
using (
var rk = Registry.CurrentUser.OpenSubKey(
#"SOFTWARE\Microsoft\Internet Explorer\Main\FeatureControl\FEATURE_BROWSER_EMULATION", true)
)
{
if (!uninstall)
{
dynamic value = rk.GetValue(exename);
if (value == null)
rk.SetValue(exename, (uint)11001, RegistryValueKind.DWord);
}
else
rk.DeleteValue(exename);
}
}
catch
{
}
}
Code courtesy of this blog
The values you can use in place of 11001 can be found in MSDN
Alternatively; can you do what you want by using WebClient/HttpWebRequest rather than poking at a web browser control to navigate around? Or can you find some web service/api version of the site that will respond with JSON rather than trying to manipulate html?
I was mildly curious why you'd care what a page looks like if it's a bot that is using it, but perhaps you're hitting a "your IE is too old" from the server..
I'm using Selenium in C# with the PhantomJS Driver.
I need to click specific coordinates on a website, that works with using Javascript (im using the ExecutePhantomJS(string script) function of the selenium phantomjs driver). I also need to capture the network traffic. I used browsermob earlier to do that, but for now i cant use it because i also need to use another proxy. So i solved it like that until now:
//Hide CMD of PhantomJS.exe
var driverService = PhantomJSDriverService.CreateDefaultService();
driverService.HideCommandPromptWindow = true;
//Initialize Driver and execute Network Script to capture traffic
driver = new PhantomJSDriver(driverService);
driver.ExecutePhantomJS(networkScript);
//Call URL
driver.Navigate().GoToUrl(url);
This is the networkScript:
string networkScript = "var page = this; page.onResourceRequested = function(req) { console.log('received: ' + JSON.stringify(res, undefined, 4)); }; page.onResourceReceived = function(res) { console.log('received: ' + JSON.stringify(res, undefined, 4)); };";
The good thing:
URL is called and all network traffic is logged into the console of the PhantomJS.exe.
But I dont know how I can get these console logs now in my C# code (I need to get specific things like URLs etc.. out of the network log).
I already read the whole afternoon but couldn't find a solution until now. Some of the things I already tried:
1) Tried to use PhantomJSOptions, there u can set LoggingPreferences and later i called driver.Manager().Logs.GetLog(LogType), but there were none of the console logs
2) Dont use console.log inside networkScript. I used require('system').stdout.write(...). It was also logged into console but I cant get the standard output stream of the phantomjs.exe from my C# code.
...
I really dont know how i could solve the problem.
One way would be to log into a .txt file instead of console, but it is very much text and later there will be many opened drivers, so I want to avoid that because then i will have very much and big .txt files
I'm trying to do some simple DOM manipulation when a page is rendered as a PDF using ABCPdf. I followed what they document here: http://www.websupergoo.com/helppdf9net/source/5-abcpdf/xhtmloptions/2-properties/usescript.htm
But when I try something as simple as the following:
var doc = new Doc();
doc.HtmlOptions.UseScript = true;
doc.HtmlOptions.UseNoCache = true;
doc.HtmlOptions.PageCachePurge();
doc.HtmlOptions.OnLoadScript = #"var reportElms = document.getElementsByClassName(""report"");";
doc.Page = doc.AddPage();
doc.AddImageUrl(Url.Action("TestPdf", "Pdf", new { }, "http"));
I get the exception:
Unable to render HTML. Unable to apply JScript.
COM error 80020101.
Script 'var reportElms = document.getElementsByClassName("report");'.
Any thoughts as to what I'm doing wrong?
Not even the built in functions work
I'm even getting the same exception with the following script:
doc.HtmlOptions.OnLoadScript = #"
window.ABCpdf_RenderWait();
window.ABCpdf_RenderComplete();";
Btw, I'm using version 8 because that's what we have a licence for.
Edit:
I was missing the .external for the ABCpdf_RenderWait() and ABCpdf_RenderComplete() calls. It works if you reference them properly (imagine that):
doc.HtmlOptions.OnLoadScript = #"
window.external.ABCpdf_RenderWait();
window.external.ABCpdf_RenderComplete();";
Though as I mention in my answer, there are a lot of security hoops that need to be jumped through for IE also.
So I didn't actually get the IE engine to execute JavaScript the way I wanted but I was able to find a solution using the Gecko engine. The original NuGet install did not include the Gecko DLL, so I just downloaded the standalone install and added the DLLs manually.
After that everything worked exactly as expected.
I believe that the IE engine didn't work because it requires a lot of security configuration, because the FAQs spend a lot of time discussing debugging of security: http://www.websupergoo.com/support.htm#6.7
I've got a website that I'd like to pull data from and it's really stuck in the stone ages. There's no web service, no API and it's very much an ASP/Session/table-based-layout page. Pretty fugly.
I'd like to just screen scrape it and use js (coffeescript) to automate that. I wonder if this is possible. I could do this with C# and linqpad but then I'm stuck parsing the tables (and sub-tables and sub-sub-tables) with regex. Plus if I do it with js or coffeescript I'll get much more comfortable with those languages and I'll be able to use jQuery for pulling elements out of the DOM.
I see two possibilities here:
use C# and find a library that will do things like Jquery but in C# code
use coffeescript (js) and use jquery to find the elements that I'm looking for in the page
I'd also like to automate the page a bit (get next set of results). This is strictly for personal use -- I'm not pulling results of someone's search to use in my business. I just want to make a crappy search engine do what I want.
I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.
Here is how you would use it to get the all the links from a few different pages:
/*
* background.js by Jeremy Karmel.
*/
URLS = ['http://www.apple.com/',
'http://www.google.com/',
'http://www.facebook.com/',
'http://www.stanford.edu'];
//Function will be provided to exector to collect information
var getLinks = function() {
var links = [];
var numLinks = $('a');
$links.each(function(i, val) {links.push(val.href)});
var request = {data: links, url: window.location.href};
chrome.extension.sendRequest(request);
}
var main = function() {
var specForUsersTopics = {
urls : URLS,
code : getLinks,
callback : function(results) {
for (var url in results) {
console.log(url + ' has ' + results[url].length + ' links.');
var links = results[url];
for (var i = 0; i < links.length; i++)
console.log(' ' + links[i]);
}
console.log('all done!!!!');
}
};
var exec = Executor(specForUsersTopics);
exec.start();
}
main();
So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.
I'm liking Curtain A) "use C# and find a library..."
"HTML Agility Pack" might be just what you're looking for:
http://htmlagilitypack.codeplex.com/
You can do it easily with Node.js, jsdom, and jQuery. See this tutorial (in JavaScript).