following scenario: We've developed around 400 personal sites and we are currently trying to build our portfolio. Due to multiple reasons we would like to display the index so we can put it on our portfolio. First thought was to make programatically screenshots of every site. The heads in our company promptly debunked it because they want to show it live. Iframes are not an alternative apparently. So we have to download the index. Possibly only with the styles and images needed to display it properly.
I am unsure on how to start doing this.
Do you guys have any ideas?
The underlying technology of CodedUI (and Selenium) uses a web crawler to isolate specific useful parts of a web page. I recommend using that underlying library to crawl your webpages running live, and extract whatever images and divs make up your page structure.
You can then emit these as static HTML to make page snapshots suitable for a site index.
Doing it this way means you will be using the same technology as you use for test automation, but instead of running tests, you can extract the useful structure from your HTML and emit it as a page snapshot. You will have to mark the "useful" parts of your HTML to enable the crawler to extract just the items you think should be indexed (i.e. include a data- property if HTML5). This might be a lot of work - so if you just need a screenshot of each of your pages, just use Selenium or CodedUI to crawl your sites and capture the screen image.
Related
I'm trying to scrape a particular webpage which works as follows.
First the page loads, then it runs some sort of javascript to fetch the data it needs to populate the page. I'm interested in that data.
If I Get the page with HtmlAgilityPack - the script doesn't run so I get what it essentially a mostly-blank page.
Is there a way to force it to run a script, so I can get the data?
You are getting what the server is returning - the same as a web browser. A web browser, of course, then runs the scripts. Html Agility Pack is an HTML parser only - it has no way to interpret the javascript or bind it to its internal representation of the document. If you wanted to run the script you would need a web browser. The perfect answer to your problem would be a complete "headless" web browser. That is something that incorporates an HTML parser, a javascript interpreter, and a model that simulates the browser DOM, all working together. Basically, that's a web browser, except without the rendering part of it. At this time there isn't such a thing that works entirely within the .NET environment.
Your best bet is to use a WebBrowser control and actually load and run the page in Internet Explorer under programmatic control. This won't be fast or pretty, but it will do what you need to do.
Also see my answer to a similar question: Load a DOM and Execute javascript, server side, with .Net which discusses the available technology in .NET to do this. Most of the pieces exist right now but just aren't quite there yet or haven't been integrated in the right way, unfortunately.
You can use Awesomium for this, http://www.awesomium.com/. It works fairly well but has no support for x64 and is not thread safe. I'm using it to scan some web sites 24x7 and it's running fine for at least a couple of days in a row but then it usually crashes.
I am currently optimizing my site for search engines. It is mainly a database driven site. I am using C# on the back end but database content is loaded via jQuery ajax and a web service. Therefore, my database content is not in html at the point that the bots will crawl it. My site is kind of like an online supermarket format in that there are thousands of items in my database, users can load a single one of these or more onto the web page at a time and the page does not change significantly once items are loaded.
My question is, how (if at all) can I get my database contents indexed? I was thinking of having an anchor that links to an aspx page (eg called mydatabase) which loads all of my database items as a big html list. Then, using jQuery, I would make the anchor invisible to users. The data would still be accessible to users but not by this link, it would be accessed by using the jQuery interface I have created.
The thing is, I don't really want users to see this big, messy list - would google results show this page eg www.mysite.com/mydatabase.aspx as a search result? Also would google see this as "keyword rich" spam page? I have done quite a lot of research but found nothing on this. only instructions for php. Please help I'm not sure what to do and need to know the best way to go about this.
It's a shame you haven't taken the progressive enhancement approach as it would mean you would have started with a standard HTML output that's crawlable, and then adding the layering behaviour (AJAX) on top for the user experience.
Providing a single file (e.g. mydatabase.aspx) that lists all of your products in a list format provides no real value for the reason you gave - it would just be a big useless list. No editorial content relevance for each link etc.
You're much better off taking another look at your information architecture and trying ensure that each product is accessibile by it's own unique URL, then classifying the products into groups (result pages), being careful to think about pagination.
You can still make this act like a single-page application using AJAX, but you'd want to look into HTML5's History API to achieve this in a search engine friendly way.
We are using Html Agility Pack to scrape data for HTML-based site; is there any DLL like Html Agility Pack to scrape flash-based site?
It really depends on the site you are trying to scrap. There are two types of sites in this regard:
If the site has the data inside the swf file, then you'll have to decompile the swf file, and read the data inside. with enough work you can probably do it programmatically. However if this is the case, it might be easier to just gather the data manually, since it's probably isn't going to change much.
If most cases however, especially with sites that have a lot of data, the flash file is actually contacting an external API. In that case you can simply ignore the flash altogether and get to the API directly. If your not sure, just activate Firebug's net panel, and start browsing. If it's using an external api it should become obvious.
Once you find that API, you could probably reverse engineer how to manipulate it to give you whatever data you need.
Also note that if it's a big enough site, there are probably non-flash ways to get to the same data:
It might have a mobile site (with no flash) - try accessing the site with an iPhone user-agent.
It might have a site for crawlers (like googlebot) - try accessing the site with a googlebot user-agent.
EDIT:
if your talking about crawling (crawling means getting data from any random site) rather then scraping (Getting structured data from a specific site), then there's not much you can do, even googlebot isn't scrapping flash content. Mostly because unlike HTML, flash doesn't have a standardized syntax that you can immediately tell what is text, what is a link etc...
You won't have much luck with the HTML Agility Pack. One method would be to use something like FiddlerCore to proxy HTTP requests to/from a Flash site. You would start the FiddlerCore proxy, then use something like the C# WebBrowser to go to the URL you want to scrape. As the page loads, all those HTTP requests will get proxied and you can inspect their contents. However, you wouldn't get most text since that's often static within the Flash. Instead, you'd get mostly larger content (videos, audio, and maybe images) that are usually stored separately. This will be slowed compared to more traditional scraping/crawling because you'll actually have to execute/run the page in the browser.
If you're familiar with all of those YouTube Downloader type of extensions, they work on this same principal except that they intercept HTTP requests directly from FireFox (for example) rather than a separate proxy.
I believe that Google and some of the big search engines have a special arrangement with Adobe/Flash and are provided with some software that lets their search engine crawlers see more of the text and things that Google relies on. Same goes for PDF content. I don't know if any of this software is publicly available.
Scraping Flash content would be quite involved, and the reliability of any component that claims to do so is questionable at best. However, if you wish to "crawl" or follow hyperlinks in a Flash animation on some web page, you might have some luck with Infant. Infant is a free Java library for web crawling, and offers limited / best-effort Flash content hyperlink following abilities. Infant is not open source, but is free for personal and commercial use. No registration required!
How about capturing the whole page as an image and running an OCR on the page to read the data
im starting the pseudo code of a new site, and want it to be as SEO friendly as possible.
the site i am creating is a booking agency site with c# and asp.net. essentially bands will register on the site with their availability and other info, and fill out their profile information with images etc. this info will be stored in a db.
creating this is not a problem, but i want the site to be a SEO friendly as possible.
I know google loves huge sites with great content. And all of these profile pages would be an excellent addition to my site for seo purposes. i also hear that google cannot see dynamically generated content when crawling a site.
i want to find a method of coding these pages, so google can see the content when it crawls them.
i need a pointer in the right direction for a solution for this. nothing is off limits - i will basically code my entire site around this principle, i just have no idea where to start looking for a solution. im not looking for a code solution, just what i should be researching to solve this issue.
Thanks in advance
i also hear that google cannot see dynamically generated content when crawling a site.
Google can see anything you can retrieve via http GET request (ie: there's a specific URL for it) and that someone either linked to or is listed in a published xml site map file.
To make sure that your profile pages fit this, you will want to make sure that profiles are all rendered via a single asp.net *.aspx file that determines which page is shown via a url parameter. Something that looks like this:
http://example.com/profiles.aspx?profile=SomeBandName
Now, you probably also want a friendly URL, that looks like this:
http://example.com/profiles/SomeBandName
To do that, you need to set up routing.
In order to crawl and index your pages by google or other search engine properly. Follow the following guidelines.
i: Page title must be precise and according to content available in page.
ii: Page url should be user friendly.
iii: Content is king (useful content)
iv: No ajax or javascript oriented way to load contents.
v: No flash or other media files. if exist must have description via alt tag.
vi: Create url sitemap of all static and dynamically generated contents.
vii: Submit sitemap to google and keep tracking how google crawl and index your pages.
fix issues contineously if google found via crawling.
In this way your most pages and content will be index properly and fastly.
I'd look into dynamic URL Rewriting.
Basically instead of having one page say http://localhost/Profile.aspx you'll have a bunch of simulated urls like
http://localhost/profiles/Band1
http://localhost/profiles/Band2
http://localhost/profiles/Band3
etc.
All of those will then map to back to the orgial profile.aspx page with a parameter so internally in your code it would look like http://localhost/Profile.aspx?Name=Band1, http://localhost/Profile.aspx?Name=Band2, etc
Basically your website appears to have a bunch of pages for each band but in reality they are all getting mapped back to the same asp.net page but have different parameters.
This is article I read about it some time back. http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
i also hear that google cannot see dynamically generated content when crawling a site.
you could create a sitemap.xml with the urls pointing to the dynamic profile pages. using google webmaster tools you can submit and monitor the crawling progress.
you may also create an index page or something similar ('browse by category' pages) that link to matching profile pages.
a reference for seo I regularly use is http://www.seomoz.org/learn-seo
One of my friends is working on having a good solution to generate aspx pages, out of html pages generated from a legacy asp application.
The idea is to run the legacy app, capture html output, clean the html using some tool (say HtmlTidy) and parse it/transform it to aspx, (using Xslt or a custom tool) so that existing html elements, divs, images, styles etc gets converted neatly to an aspx page (too much ;) ).
Any existing tools/scripts/utilities to do the same?
Here's what you do.
Define what the legacy app is supposed to do. Write down the scenarios of getting pages, posting forms, navigating, etc.
Write unit test-like scripts for the various scenarios.
Use the Python HTTP client library to exercise the legacy app in your various scripts.
If your scripts work, you (a) actually understand the legacy app, (b) can make it do the various things it's supposed to do, and (c) you can reliably capture the HTML response pages.
Update your scripts to capture the HTML responses.
You have the pages. Now you can think about what you need for your ASPX pages.
Edit the HTML by hand to make it into ASPX.
Write something that uses Beautiful Soup to massage the HTML into a form suitable for ASPX. This might be some replacement of text or tags with <asp:... tags.
Create some other, more useful data structure out of the HTML -- one that reflects the structure and meaning of the pages, not just the HTML tags. Generate the ASPX pages from that more useful structure.
Just found HTML agility pack to be useful enough, as they understand C# better than python.
I know this is an old question, but in a similar situation (50k+ legacy ASP pages that need to display in a .NET framework), I did the following.
Created a rewrite engine (HttpModule) which catches all incoming requests and looks for anything that is from the old site.
(in a separate class - keep things organized!) use WebClient or HttpRequest, etc to open a connection to the old server and download the rendered HTML.
Use the HTML agility toolkit (very slick) to extract the content that I'm interested in - in our case, this is always inside if a div with the class "bdy".
Throw this into a cache - a SQL table in this example.
Each hit checks the cache and either a)retrieves the page and builds the cache entry, or b) just gets the page from the cache.
An aspx page built specifically for displaying legacy content receives the rewrite request and displays the relevant content from the legacy page inside of an asp literal control.
The cache is there for performance - since the first request for a given page has a minimum of two hits - one from the browser to the new server, one from the new server to the old server - I store cachable data on the new server so that subsequent requests don't have to go back to the old server. We also cache images, css, scripts, etc.
It gets messy when you have to handle forms, cookies, etc, but these can all be stored in your cache and passed through to the old server with each request if necessary. I also store content expiration dates and other headers that I get back from the legacy server and am sure to pass those back to the browser when rendering the cached page. Just remember to take as content-agnostic an approach as possible. You're effectively building an in-page web proxy that lets IIS render old ASP the way it wants, and manipulating the output.
Works very well - I have all of the old pages working seamlessly within our ASP.NET app. This saved us a solid year of development time that would have been required if we had to touch every legacy asp page.
Good luck!