HTML Page Scraping

HTML Page Scraping - c#

What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?

This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck

Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).

Crawljax is open source and can dynamically crawl Ajax-based content.

Related

Display thumbnail of page when displayed in website search results

Our client wants to display a thumbnail of a screenshot of a page when listed in the search results of the website. Of course, they want it to be automated. The website is build on Sitecore 9.0 and uses SOLR for indexing. It seems that creating a computed index field would be the best option performance-wise, but I feel like it will still take forever when running a full index rebuild as it's making an http request for every page.
I took a look at some solutions for capturing thumbnails, this one looks to be the most promising http://html2canvas.hertzen.com/. However, it doesn't seem like this will work with server-side C# http requests. Also, not sure how I'd selectively toggle the html2canvas event on pages as well as have the page send back the image as a response in the http request.
Any other solution ideas would be appreciated.

Best way to collect information from web page

I need to get information from couple of web sites . For example this site
What would be the best way to get all the links from the page so that the information could be extracted.
Some times need to click on a link to get other links inside that.
I tried Watin and I tried doing the same from within Excel 2007 with Web Data option.
Could you please suggest some better way which I am not aware of .

Ncrawler might be very useful for the deep level crawling . You could also set the MaxCrawlDepth for specifying the same.

Have a look at WGet. It is an incredibly powerful tool for mining the content of a single page or an entire website. The options available allow you to dictate how many levels deep to follow in terms of links, what to do with static resources such as images, how to handle relative links, etc. It also does a very good job of mining pages which are generated dynamically, such as those served by CGI or ASP.
It's been around for many years in the 'nix world but executables compiled for Windows are readily available.
You would need to kick it off from .NET using Process.Start but you could then pipe the results into multiple files (which mimic the original website structure), a single file, or into memory by capturing standard output. Then you can do subsequent analysis such as extracting HREF HTML elements (if it is only links you are interested in) or grabbing the sort of table data evident in the link you provide in your question.
I realise this is not a 'pure' .NET solution but the power WGET offers more than compensates for this, in my opinion. I have used it myself in the past, in this way, for exactly the sort of thing I think you are trying to do.

I recommend to use http://watin.org/. This is much simpler than wget :-)

How to scrape a flash based site?

We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?

It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...

You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.

Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!

How about capturing the whole page as an image and running an OCR on the page to read the data

How to capture visited URLs and their html by any browsers

I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.

Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap

Guitar Tab API?

I was just curious if anyone has heard of any sort of API for guitar tabs? The thought passed through my mind that it would be really neat if I could grab guitar tabs from the internet to bring into my C# app, but haven't been able to find anything.
Thanks,
Chris

You can use System.Net.WebRequest to read the repository of your choice.
You can also use System.ServiceModel.Syndication.SyndicationFeed to read RSS feeds.
EDIT: Try scraping http://www.mxtabs.net/guitar_tabs/ using WebRequest.
You can write code that sends requests to the pages of that (or any other) website and parses the responses to extract information.
However, you might want to get their permission first. They might even offer you an API.

You should approach it from another angle.. Why isn't there a good data format to transfer guitar tab data? I mean, ASCII art is good but it is easily damaged, and it doesn't convery timing information well.
If you could come up with a format that could reach critical mass, that would be a good thing.

To scrape a web page, first figure out exactly which data you want to extract.
Visit the relevant pages with Fiddler running, and look at the HTTP requests and responses that you get.
You can then write C# code that requests the relevant page and reads through the response, line by line, looking for lines that you're interested in.
If the web page is XHTML compliant, you can also parse it using XDocument, but most web pages aren't.

http://www.911tabs.com/ seems to have a large selection, and they appear to all be a variant of ASCII-art, so it would be relatively easy to write an HTML- or text-scraping routine. They don't appear to use a very standardized format, however, so this might be more work than I think.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.