Scraping data, loading scripts - c#

Lately I'm trying to scrap some data from the web page using C#.
My problem is, that in C# when I'm using WebBrowser object to manipulate with the web page, when I navigate to my web page in body I only get:
<body>
<script language="javascript" src="com.astron.kapar.WebClient/com.astron.kapar.WebClient.nocache.js"></script>
</body>
But if you go on actual web page https://kapalk1.mavir.hu/kapar/lt-publication.jsp?locale=en_GB and look the source you see there is some tables in body probably because browser loads scripts.
My question is, What is the way in C# to manipulate or deal with that kind of web page? For example to choose some dates and get some data? Is there any good library?
Sorry for bad English.

You need to use either headless IE, or headless WebKit.
These questions might also be relevant.
Headless browser for C# (.NET)?
c# headless browser with javascript support for crawler

If you are familiar with javascript, one good solution for scrapping javascript-driven site would be casperjs.
I find casperjs really easy to work with for scrapping javascript-heavy site.
Write a casperjs script to scrap the site with css selectors and send your desired output as JSON to stdout using JSON.Stringify.
Invoke casperjs from C# using ProcessStartInfo. Read from stdout and serialize the json back to POCO.

Related

HtmlAgilityPack table returns null when selecting nodes [duplicate]

I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.

Loading Javascript with C# Console Application

I am currently using htmlAgilityPack for some web scraping, however I've encountered a website that has script tags and I am unable to load it for scraping. I have little experience with web and am unsure how to properly load the webpage and convert back to something htmlAgility can parse.
Pretty much, when I inspect element in chrome, there is a table, but the htmlAgilityPack reads a script tag.
Any help would be appreciated.
Thank you
I have had similar problems too. It is very annoying that their is not one unified method of doing on all websites in a C# console.
However depending on the site you are looking at there may be some information in meta tags in the head section of the html. When I was making an application to get Youtube Subscription count I found it had the count in a meta tag (I assume this information is here for the scripts to use). This may be similar for the web page you are scraping.
To do this I first added a
document.save(//put a link to where the html file needs to go)
then I opened the html document in Google Chrome, opened up dev tools and did a search for "Subscriptions" (You can replace this for whatever you are looking for). Hopefully depending on the website you are scraping there may be a tag with some info in it for you.
Good Luck! :)

Parsing web page with HtmlAgilityPack and simulate a click

I am scraping a certain web page using HAP, and I want to access the submit button on the page but the problem is I don't know how it could be done in HAP and C#, is there a way I could do this?
The HTML Agility Pack is not a browser, so while it can parse an HTML file, there is no way to really interact with it. You can find the submit object, read its properties and so forth, but you can't make it do anything.
You have two options:
Either read the form, build a Http Request object that matches the forms fields and post method and send it to the server. This is all manual work. The Agility Pack only helps you list the fields on the form and their properties
If you need to interact with the page you'll need a browser. There are headless browsers, like PhantomJS, that will actually load the page, parse the Javascript and run what's sent by the server. There are wrappers around those wrappers for C#, one of such examples is Awesonium. It's similar to the HTML Agility Pack in that it allows you to parse HTML documents, but it takes it one step further, actually running it without ever showing a browser screen.

how to run javascript code in C# crawler on server side

I have developed crawler in C#.
I am reading data from one page that is list page, It uses javascript for redirecting to next page.
function is - <a onclick="redirectToNextPage(PageID)">More</a>
How i can run this function in serverside and get url of the next page, so that by that url i can save that page.
I want to run javascript function in C# to get url of next page
You'll almost certainly need a headless browser to do that, not just running JavaScript code without the context it expects to run in. This question and its answer list some headless browsers that can be used from C# (not all of them have JavaScript support, though). That list may well be out of date now, but that's the term you need to search for.
Try https://javascriptdotnet.codeplex.com/.
It exposes Google V8 JS engine to CLI and also allows to CLI objects to be manipulated by JS

C# AJAX or Java response HTML scraping

Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack

Categories