Web Scraper design

Web Scraper design - c#

can we make application that searches google for a word and navigates to various pages? Using Httpwebresponse or and search for a word on rendered page?and it should have multiple proxy usage i.e all above is multi threaded and each thread has different proxy.
So far I have failed to do so in return GetResponse says "Method not allowed"

You might want to look at Google API as that would be much more easy to access but I do no know if your usage would require you to buy the service license.
Also, verify that you get the right response as Google sometimes uses iframes for content, maybe you only get the outer frame?

Related

Scraping web pages (including AJAX) from a .NET solution

Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.
I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?

Your question is purely .NET-C#-ish (at least I assume you use .net c#).
Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.
For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.
I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).

Capture failed loads of contents using Selenium

When loading a web page, it executes many GET request to fetch resources, such as images, css files, fonts and other stuff.
Is there a possibility to capture failed GET requests using Selenium in C#?

Selenium does not natively provide this capability. I'm coming to this conclusion for two reasons:
I've not seen any function exported by Selenium's API which would allow doing what you want in a cross-platform way.
(I'm saying "cross-platform way" because I'm excluding from considerations possible non-standard APIs that could be exported by one browser but not others.)
If there is any doubt I may have missed something, then consider that ...
The Selenium team has quite consciously decided not to provide any means to get the response code of the HTTP request that downloads the page in the first place. It is extremely doubtful that they would have slipped behind the scenes a way to get the response code of the other HTTP requests that are launched to load other resources.
The way to check on any such requests is to have the Selenium browser launched by Selenium connect through a proxy that records such responses. Or to load the page with something else than Selenium.

HTML Page Scraping

What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?

This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck

Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).

Crawljax is open source and can dynamically crawl Ajax-based content.

How to scrape a flash based site?

We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?

It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...

You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.

Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!

How about capturing the whole page as an image and running an OCR on the page to read the data

search several web sites using C#

Can I use C# to auto search websites, then return the search results?
Is there a web crawler that would do the same thing if I give it a top domain (ex: I tell it find the word "funny" on stackoverflow.com, and it would tell me all the times "funny" appeared)?
These web sites allow searching via their search bar.
Do I need the web sites cooperation to automate searches?
NOTE: I only plan to be doing about one or two searches a day, so I doubt I'll be blocked, or asked to authenticate myself.

If your planning on crawling through an entire website to count words like that if you dont cache it you will get blocked, youll be requesting every page of the website essentially. Perhaps consider integrating google domain search's instead?
Here is a link to googles page detailing how to interface with c#
http://code.google.com/apis/gdata/client-cs.html
EDIT: Sorry that wasn't quite right : http://gsalib.codeplex.com/
http://answers.oreilly.com/topic/2165-how-to-search-google-and-bing-in-c/

I would look into building an RSS aggregator. RSS is standardized, so that's probably the most reliable way to collect search results from various sources.
EDIT: For sites that don't support RSS
For the sites that don't support RSS, you can look into using a screen scraper. Check out this article on The Code Project to get you started:
http://www.codeproject.com/KB/aspnet/weather.aspx

...web sites allow searching via their search bar ... Can I use C# to auto search websites, then return the search results?
Yes, if the website provides a URL where the search-term is provided as a query-string argument to a URL.
http://yourTargetDomain?searchterm=foo
But unless the website has specifically designed the search results from that URL to be structured data, the website won't "tell [you] all the times 'funny' appeared" but will send you back a search response that is suitable for a browser to display, so you would have to parse the results out of this stream of HTML.
For example:
http://philadelphia.craigslist.org/search/tls?query=ladder&srchType=A&minAsk=&maxAsk=

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.