how to run javascript code in C# crawler on server side - c#

I have developed crawler in C#.
I am reading data from one page that is list page, It uses javascript for redirecting to next page.
function is - <a onclick="redirectToNextPage(PageID)">More</a>
How i can run this function in serverside and get url of the next page, so that by that url i can save that page.
I want to run javascript function in C# to get url of next page

You'll almost certainly need a headless browser to do that, not just running JavaScript code without the context it expects to run in. This question and its answer list some headless browsers that can be used from C# (not all of them have JavaScript support, though). That list may well be out of date now, but that's the term you need to search for.

Try https://javascriptdotnet.codeplex.com/.
It exposes Google V8 JS engine to CLI and also allows to CLI objects to be manipulated by JS

Related

HtmlAgilityPack table returns null when selecting nodes [duplicate]

I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

C# Get HTML generated by Javascript in a .Net Console Application

I have to write a Console Application that grap and parse data from a website.
Unluckly, the website uses some kind of Javascript framework to compose the page.
So what I need to do is get the HTML once time the page is rendered by Javascript.
This is just the first step, my second step is to navigate the website to collect data from different page but... Unluckly the pages that I have to parse does not have Urls, but they are loaded from Javascript too...
Do you have some ideas ?
Thanks to support
Dario

Scraping data, loading scripts

Lately I'm trying to scrap some data from the web page using C#.
My problem is, that in C# when I'm using WebBrowser object to manipulate with the web page, when I navigate to my web page in body I only get:
<body>
<script language="javascript" src="com.astron.kapar.WebClient/com.astron.kapar.WebClient.nocache.js"></script>
</body>
But if you go on actual web page https://kapalk1.mavir.hu/kapar/lt-publication.jsp?locale=en_GB and look the source you see there is some tables in body probably because browser loads scripts.
My question is, What is the way in C# to manipulate or deal with that kind of web page? For example to choose some dates and get some data? Is there any good library?
Sorry for bad English.
You need to use either headless IE, or headless WebKit.
These questions might also be relevant.
Headless browser for C# (.NET)?
c# headless browser with javascript support for crawler
If you are familiar with javascript, one good solution for scrapping javascript-driven site would be casperjs.
I find casperjs really easy to work with for scrapping javascript-heavy site.
Write a casperjs script to scrap the site with css selectors and send your desired output as JSON to stdout using JSON.Stringify.
Invoke casperjs from C# using ProcessStartInfo. Read from stdout and serialize the json back to POCO.

Referencing jquery variables in C# code (ASP.NET MVC)

How do I reference a jquery variable in a C# block in the view using ASP.NET MVC?
For example:
$(":input[#name='mydropdown']").change(function () {
var selection = $("#myselection").val();
pop($("#md"), <%= Model.choices[selection] %>);
});
Where the selection that is in my C# block is the same as the selection that is referred to in my jquery.
This is not possible to do. The C# code is executed before the HTML is sent to the user's browser, which is before jQuery gets loaded, which is before the variable selection has a chance to exist.
There are two approaches to work around this:
Dump all data that you care from Model.choices to a JavaScript variable; your JS code can then access that variable. This is simple and good if your data is not too large in volume.
Have the JS code make an AJAX request to the server to get whatever data it needs by passing the value of selection as a query string parameter.
Perhaps try Sharpkit plugin from jquery website:
http://plugins.jquery.com/project/SharpKit
You can't do this because of the browser (client) does not share any memory or state with the server what so ever
i.e.
the server executes the c# that renders the html and js
the browser downloads this and interprets it
the browser executes the javascript (no c#!)
I'd go with Jon's suggestion 1) as it will be more performant by negating the need for another callback to the server.
Long live ASP.NET MVC! :)

C# AJAX or Java response HTML scraping

Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack

Categories