how can I use js/coffee to screen scrape an asp page? - c#

I've got a website that I'd like to pull data from and it's really stuck in the stone ages. There's no web service, no API and it's very much an ASP/Session/table-based-layout page. Pretty fugly.
I'd like to just screen scrape it and use js (coffeescript) to automate that. I wonder if this is possible. I could do this with C# and linqpad but then I'm stuck parsing the tables (and sub-tables and sub-sub-tables) with regex. Plus if I do it with js or coffeescript I'll get much more comfortable with those languages and I'll be able to use jQuery for pulling elements out of the DOM.
I see two possibilities here:
use C# and find a library that will do things like Jquery but in C# code
use coffeescript (js) and use jquery to find the elements that I'm looking for in the page
I'd also like to automate the page a bit (get next set of results). This is strictly for personal use -- I'm not pulling results of someone's search to use in my business. I just want to make a crappy search engine do what I want.

I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.
Here is how you would use it to get the all the links from a few different pages:
/*
* background.js by Jeremy Karmel.
*/
URLS = ['http://www.apple.com/',
'http://www.google.com/',
'http://www.facebook.com/',
'http://www.stanford.edu'];
//Function will be provided to exector to collect information
var getLinks = function() {
var links = [];
var numLinks = $('a');
$links.each(function(i, val) {links.push(val.href)});
var request = {data: links, url: window.location.href};
chrome.extension.sendRequest(request);
}
var main = function() {
var specForUsersTopics = {
urls : URLS,
code : getLinks,
callback : function(results) {
for (var url in results) {
console.log(url + ' has ' + results[url].length + ' links.');
var links = results[url];
for (var i = 0; i < links.length; i++)
console.log(' ' + links[i]);
}
console.log('all done!!!!');
}
};
var exec = Executor(specForUsersTopics);
exec.start();
}
main();
So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.

I'm liking Curtain A) "use C# and find a library..."
"HTML Agility Pack" might be just what you're looking for:
http://htmlagilitypack.codeplex.com/

You can do it easily with Node.js, jsdom, and jQuery. See this tutorial (in JavaScript).

Related

JINT - Unable to "console.log"

I am new to JINT, and trying to just do some basic tests to kind of learn the ropes. My first attempt was to just store some javascript in my database, load it, and execute it in a unit test. So that looks essentially like this....
[Fact]
public void can_use_jint_engine() {
using (var database = DocumentStore()) {
using (var session = database.OpenSession()) {
var source = session.Load<Statistic>("statistics/1");
// join the list of strings into a single script
var script = String.Join("\n", source.Scripting);
// this will create the script
// console.log("this is a test from jint.");
//
var engine = new Jint.Engine();
// attempt to execute the script
engine.Execute(script);
}
}
}
And it doesn't work, I get this error, which makes absolutely no sense to me, and I cannot find any documentation on.
Jint.Runtime.JavaScriptExceptionconsole is not defined at
Jint.Engine.Execute(Program program) at
Jint.Engine.Execute(String source) at
SampleProject.Installers.Instanced.__testing_installer.can_use_jint_engine()
in _testing_installer.cs: line 318
Can anyone assist in shedding some light on this? I'm pretty confused at this point.
With JavaScript there are three entities - we care about. The host (browser, your application etc), the engine (JINT in this case) and the script ("console.log(...)") in this case.
JavaScript defines a bunch of functions and object as part of the language, but console is not one of them. By convention, browsers define a console object that can be used in the manner you describe. However, since your app is not a browser (and JINT does not do this by itself), there's no console object defined in your namespace (globals).
What you need to do is add a console object that will be accessible in JINT. You can find how to do this in the docs, but here's a simple example of how to add a log function to the engine so it can be used from the JS code (example taken from github).
var engine = new Engine()
.SetValue("log", new Action<object>(Console.WriteLine))
;
engine.Execute(#"
function hello() {
log('Hello World');
};
hello();
");

How to get URI with Phantomjs

I need to get a list of all web-pages in web-site (all links). I have to use Phantomjs, but I never have used it before. Can anybody explain me, how I can use it? How to parse the html code with help of Phantomjs to get all links?
PhantomJS is a headless WebKit scriptable with a JavaScript API. It's redistrributed as a single executable.
Download phantomJS from the official web site
There are official release for Windows, Mac ou Linux but you can also build your own version if you want.
Create a script
PhantomJS does nothing by itself, it's just an executable. You have to code/script your action. It's done by javascript or Coffee Script.
Run the script
From the command prompt type, you just have to write
> phantomjs yourscript.js
Sometimes, your have to create a wrapper for phantomjs. Especially in WPF, use Process/ProcessStartInfo class to manage the script execution.
How to write a script ?
If your familiar with Javascript and especially Node.js developpment, the learning curve is small. The quick start could be precious, and do not hesitate to practice yourself with available examples. That's the most difficult part, but after a few scripts it will be easier.
To answer your initial question, here is a possible script
var page = require('webpage').create();
var system = require('system');
if (system.args.length != 2) {
console.log('Usage: so20189669.js <URL> ');
phantom.exit(1);
} else {
var url = system.args[1];
page.open(url, function (status) {
if (status !== 'success') {
console.log('Unable to load the address!');
phantom.exit();
} else {
var links = page.evaluate(function () {
return [].map.call(document.querySelectorAll('a'), function (link) { return link.getAttribute('href') });
});
console.log(JSON.stringify(links));
phantom.exit();
}
});
}
In the Command Prompt :
>phantomjs.exe so20189669.js http://stackoverflow.com/questions/20189669/how-to-get-uri-with-phantomjs
There is no magic answer and you will have to alter it depending on your needs !

Silverlight passing an array to a web page's dialog arguments

I have the following line of code to open a web page modal dialog in C# (Silverlight):
var so = (ScriptObject)HtmlPage.Window.Invoke(
"showModalDialog",
modalWindowUrl,
dialogArgs,
"dialogWidth:600px;dialogHeight:600px;");
Now, code similar to the following is being called on the page I am displaying, and I need to make sure it gets the values I'm trying to pass in (this is a MSCRM web page I don't have control over):
dialogArgs.items <-- will be an array I pass in
dialogArgs.items[i].getAttribute("oid") <-- will return something
dialogArgs.items[i].getAttribute("otype") <-- will return something
dialogArgs.items[i].values <-- will return something
What I have tried to send in (from my C# code) is this:
dialogArgs = #"{items:[{oid:" + id + ",otype:" + type + "}]}";
which will result in a JSON string... but I'm guessing this just ends up as a string within the JavaScript and not a JSON object.
Any ideas how I get this to work?
A few side notes:
I can't get IE to debug the modal dialog that results from this call. I can get the debugging tools displaying, but it won't attach to the page because it cannot refresh it.
I don't have control over this modal dialog. It's a page that is displayed using MS Dynamics CRM. For that reason I cannot mess with the JavaScript or anything to test stuff.
Looks like I won the tumbleweed award for this one! Can't believe how uncommon this scenario seems to be. The solution ended up being quite simple, but not very documented so took me a while to track down. Thought I would share here.
Firstly, a quick search across the internet reveals that we can set this up using the following:
var dialogArgs = HtmlPage.Window.CreateInstance("Object");
Which gives you a ScriptObject back. For properties:
dialogArgs.SetProperty("items", items);
Some code for setting up an array and an item should look something like this (I have just created a new GUID for the purpose of this example):
var item = HtmlPage.Window.CreateInstance("Object");
item.SetProperty("oid", Guid.NewGuid());
item.SetProperty("otype", "account");
var items = HtmlPage.Window.CreateInstance("Object");
items.SetProperty(0, item);
And finally, just pass that object straight into your dialog window like this:
var so = (ScriptObject)HtmlPage.Window.Invoke("showModalDialog", lookUpWindow, dialogArgs, "dialogWidth:600px;dialogHeight:600px;");

Webbrowser control is not showing Html but shows webpage

I am automating a task using webbrowser control , the site display pages using frames.
My issue is i get to a point , where i can see the webpage loaded properly on the webbrowser control ,but when it gets into the code and i see the html i see nothing.
I have seen other examples here too , but all of those do no return all the browser html.
What i get by using this:
HtmlWindow frame = webBrowser1.Document.Window.Frames[1];
string str = frame.Document.Body.OuterHtml;
Is just :
The main frame tag with attributes like SRC tag etc, is there any way how to handle this?Because as i can see the webpage completely loaded why do i not see the html?AS when i do that on the internet explorer i do see the pages source once loaded why not here?
ADDITIONAL INFO
There are two frames on the page :
i use this to as above:
HtmlWindow frame = webBrowser1.Document.Window.Frames[0];
string str = frame.Document.Body.OuterHtml;
And i get the correct HTMl for the first frame but for the second one i only see:
<FRAMESET frameSpacing=1 border=1 borderColor=#ffffff frameBorder=0 rows=29,*><FRAME title="Edit Search" marginHeight=0 src="http://web2.westlaw.com/result/dctopnavigation.aspx?rs=WLW12.01&ss=CXT&cnt=DOC&fcl=True&cfid=1&method=TNC&service=Search&fn=_top&sskey=CLID_SSSA49266105122&db=AK-CS&fmqv=s&srch=TRUE&origin=Search&vr=2.0&cxt=RL&rlt=CLID_QRYRLT803076105122&query=%22LAND+USE%22&mt=Westlaw&rlti=1&n=1&rp=%2fsearch%2fdefault.wl&rltdb=CLID_DB72585895122&eq=search&scxt=WL&sv=Split" frameBorder=0 name=TopNav marginWidth=0 scrolling=no><FRAME title="Main Document" marginHeight=0 src="http://web2.westlaw.com/result/dccontent.aspx?rs=WLW12.01&ss=CXT&cnt=DOC&fcl=True&cfid=1&method=TNC&service=Search&fn=_top&sskey=CLID_SSSA49266105122&db=AK-CS&fmqv=s&srch=TRUE&origin=Search&vr=2.0&cxt=RL&rlt=CLID_QRYRLT803076105122&query=%22LAND+USE%22&mt=Westlaw&rlti=1&n=1&rp=%2fsearch%2fdefault.wl&rltdb=CLID_DB72585895122&eq=search&scxt=WL&sv=Split" frameBorder=0 borderColor=#ffffff name=content marginWidth=0><NOFRAMES></NOFRAMES></FRAMESET>
UPDATE
The two url of the frames are as follows :
Frame1 whose html i see
http://web2.westlaw.com/nav/NavBar.aspx?RS=WLW12.01&VR=2.0&SV=Split&FN=_top&MT=Westlaw&MST=
Frame2 whose html i do not see:
http://web2.westlaw.com/result/result.aspx?RP=/Search/default.wl&action=Search&CFID=1&DB=AK%2DCS&EQ=search&fmqv=s&Method=TNC&origin=Search&Query=%22LAND+USE%22&RLT=CLID%5FQRYRLT302424536122&RLTDB=CLID%5FDB6558157526122&Service=Search&SRCH=TRUE&SSKey=CLID%5FSSSA648523536122&RS=WLW12.01&VR=2.0&SV=Split&FN=_top&MT=Westlaw&MST=
And the properties of the second frame whose html i do not get are in the picture below:
Thank you
I paid for the solution of the question above and it works 100 %.
What i did was use this function below and it returned me the count to the tag i was seeking which i could not find :S.. Use this to call the function listed below:
FillFrame(webBrowser1.Document.Window.Frames);
private void FillFrame(HtmlWindowCollection hwc)
{
if (hwc == null) return;
foreach (HtmlWindow hw in hwc)
{
HtmlElement getSpanid = hw.Document.GetElementById("mDisplayCiteList_ctl00_mResultCountLabel");
if (getSpanid != null)
{
doccount = getSpanid.InnerText.Replace("Documents", "").Replace("Document", "").Trim();
break;
}
if (hw.Frames.Count > 0) FillFrame(hw.Frames);
}
}
Hope it helps people .
Thank you
For taking html you have to do it that way:
WebClient client = new WebClient();
string html = client.DownloadString(#"http://stackoverflow.com");
That's an example of course, you can change the address.
By the way, you need using System.Net;
This works just fine...gets BODY element with all inner elements:
Somewhere in your Form code:
wb.Url = new Uri("http://stackoverflow.com");
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wbDocumentCompleted);
And here is wbDocumentCompleted:
void wb1DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var yourBodyHtml = wb.Document.Body.OuterHtml;
}
wb is System.Windows.Forms.WebBrowser
UPDATE:
The same as for the document, I think that your second frame is not loaded at the time you check for it's content...You can try solutions from this link. You will have to wait for your frames to be loaded in order to see its content.
The most likely reason is that frame index 0 has the same domain name as the main/parent page, while the frame index 1 has a different domain name. Am I correct?
This creates a cross-frame security issue, and the WB control just leaves you high and dry and doesn't tell you what on earth went wrong, and just leaves your objects, properties and data empty (will say "No Variables" in the watch window when you try to expand the object).
The only thing you can access in this situation is pretty much the URL and iFrame properties, but nothing inside the iFrame.
Of course, there are ways to overcome teh cross-frame security issues - but they are not built into the WebBrowser control, and they are external solutions, depending on which WB control you are using (as in, .NET version or pre .NET version).
Let me know if I have correctly identified your problem, and if so, if you would like me to tell you about the solution tailored to your setup & instance of the WB control.
UPDATE: I have noticed that you're doing a .getElementByTagName("HTML")(0).outerHTML to get the HTML, all you need to do is call this on the document object, or the .body object and that should do it. MyDoc.Body.innerHTML should get the the content you want. Also, notice that there are additional iFrames inside these documents, in case that is of relevance. Can you give us the main document URL that has these two URL's in it so we / I can replicate what you're doing here? Also, not sure why you are using DomElement but you should just cast it to the native object it wants to be cast to, either a IHTMLDocument2 or the object you see in the watch window, which I think is IHTMLFrameElement (if i recall correctly, but you will know what i mean once you see it). If you are trying to use an XML object, this could be the reason why you aren't able to get the HTML content, change the object declaration and casting if there is one, and give it a go & let us know :). Now I'm curious too :).

Using PlotKit (javascript) through C#

I'm relatively new to Javascript, and although I know how to use it, I don't really understand the mechanics behind it. Bear with me here.
I need to write a small app that creates a chart (in SVG) based on data I take in as an XML file. I found PlotKit, which does exactly what I need, except that it's written in Javascript, while my current program is written in c#. I did some googling and found a few articles which explain how to evaluate simple Javascript code using the .NET VsaEngine class. Unfortunately, I have absolutely no idea how to use the VsaEngine to execute more complicated Javascript that requires references to other files. Basically, all I want is for c# to be able to call something like this as Javascript:
var layout = new PlotKit.Layout("bar", {});
layout.addDataset("data", [[0, 0], [1, 1], [2, 2]]);
layout.evaluate();
var canvas = MochiKit.DOM.getElement("graph");
var plotter = new PlotKit.SVGRenderer(canvas, layout, {});
var svg = SVGRenderer.SVG();
And get back the SVG string for the chart. I have no idea how to make it so that the above script knows where to look for all of the necessary objects. If I were to make a web page to do this, I would just add a few script headers referencing /plotkit/Layout.js, /plotkit/Canvas.js, etc., the Javascript would work fine.
If anyone could explain exactly how I would use PlotKit through C#, or could explain a more effective way to do this, I would really appreciate it.
EDIT: I realize I wasn't too clear with this question - I need my c# program to emulate a Javascript engine and use the PlotKit library without actually running a web browser. Is there any way to do this?
PlotKit is a JavaScript library that is intended to execute in the Client's Web Browser. C# is executed on the Server. To go about communicating between the two, you would render whatever data you wish to pass to PlotKit on the server and then output it in the HTML you send to the client.
So in your C# codebehind you would construct the JSON object that would be passed to PlotKit's addDataset method.
...
public partial class Default : System.Web.UI.Page
{
protected string PlotKitData = "[]";
protected void Page_Load(object sender, EventArgs e)
{
if (Page.IsPostBack) PlotKitData = GenerateJSON();
...
Then in your ASPX codefront you would have something like this.
<script>
var layout = new PlotKit.Layout("bar", {});
layout.addDataset("data", <%=PlotKitData%>);
layout.evaluate();
var canvas = MochiKit.DOM.getElement("graph");
var plotter = new PlotKit.SVGRenderer(canvas, layout, {});
var svg = SVGRenderer.SVG();
</script>
Perhaps ZedGraph might suit your needs instead?

Categories