Our client wants to display a thumbnail of a screenshot of a page when listed in the search results of the website. Of course, they want it to be automated. The website is build on Sitecore 9.0 and uses SOLR for indexing. It seems that creating a computed index field would be the best option performance-wise, but I feel like it will still take forever when running a full index rebuild as it's making an http request for every page.
I took a look at some solutions for capturing thumbnails, this one looks to be the most promising http://html2canvas.hertzen.com/. However, it doesn't seem like this will work with server-side C# http requests. Also, not sure how I'd selectively toggle the html2canvas event on pages as well as have the page send back the image as a response in the http request.
Any other solution ideas would be appreciated.
Related
following scenario: We've developed around 400 personal sites and we are currently trying to build our portfolio. Due to multiple reasons we would like to display the index so we can put it on our portfolio. First thought was to make programatically screenshots of every site. The heads in our company promptly debunked it because they want to show it live. Iframes are not an alternative apparently. So we have to download the index. Possibly only with the styles and images needed to display it properly.
I am unsure on how to start doing this.
Do you guys have any ideas?
The underlying technology of CodedUI (and Selenium) uses a web crawler to isolate specific useful parts of a web page. I recommend using that underlying library to crawl your webpages running live, and extract whatever images and divs make up your page structure.
You can then emit these as static HTML to make page snapshots suitable for a site index.
Doing it this way means you will be using the same technology as you use for test automation, but instead of running tests, you can extract the useful structure from your HTML and emit it as a page snapshot. You will have to mark the "useful" parts of your HTML to enable the crawler to extract just the items you think should be indexed (i.e. include a data- property if HTML5). This might be a lot of work - so if you just need a screenshot of each of your pages, just use Selenium or CodedUI to crawl your sites and capture the screen image.
I am currently optimizing my site for search engines. It is mainly a database driven site. I am using C# on the back end but database content is loaded via jQuery ajax and a web service. Therefore, my database content is not in html at the point that the bots will crawl it. My site is kind of like an online supermarket format in that there are thousands of items in my database, users can load a single one of these or more onto the web page at a time and the page does not change significantly once items are loaded.
My question is, how (if at all) can I get my database contents indexed? I was thinking of having an anchor that links to an aspx page (eg called mydatabase) which loads all of my database items as a big html list. Then, using jQuery, I would make the anchor invisible to users. The data would still be accessible to users but not by this link, it would be accessed by using the jQuery interface I have created.
The thing is, I don't really want users to see this big, messy list - would google results show this page eg www.mysite.com/mydatabase.aspx as a search result? Also would google see this as "keyword rich" spam page? I have done quite a lot of research but found nothing on this. only instructions for php. Please help I'm not sure what to do and need to know the best way to go about this.
It's a shame you haven't taken the progressive enhancement approach as it would mean you would have started with a standard HTML output that's crawlable, and then adding the layering behaviour (AJAX) on top for the user experience.
Providing a single file (e.g. mydatabase.aspx) that lists all of your products in a list format provides no real value for the reason you gave - it would just be a big useless list. No editorial content relevance for each link etc.
You're much better off taking another look at your information architecture and trying ensure that each product is accessibile by it's own unique URL, then classifying the products into groups (result pages), being careful to think about pagination.
You can still make this act like a single-page application using AJAX, but you'd want to look into HTML5's History API to achieve this in a search engine friendly way.
What's the best way to scrape a web page that has AJAX/dynamic loading of data?
For example: scraping a webpage that presents 20 images on load, but when a user scroll down the page it loads more images (sort of like Facebook). In such a case how do you scrape all the images, not just the first 20?
This is something that not even the major search engines have mastered yet. It's called "event-driven crawling".
Google even has a guide on what to do to help them crawl your ajax sites better
Best thing would be to read some open source crawlers and see what they do. But your chances of crawling even 80% are slim at best, unless you have a specific target in mind.
There are also some interesting reads at crawljax
Basically, You should try looking for scripts and checking if they make any ajax calls, then determine what kind of parameters they take and make repeat calls with incremented/decremented parameter values. This only works if the parameters have a logical pattern, such as being numbers, single letters etc. It also depends on whether you're targeting a known site or just sending it into the wild. If you know your target you can inspect it's DOM and customize your code for greater accuracy as mentioned by wolf.
Good luck
Use a tool such as Fiddler or WireShark to inspect the web request that is done when loading more items.
Then replicate the request in your code.
Update (thanks to pguardiario ofr his comment):
Note that Wireshark is a low level network capture tool that offers a great deal of detail about the traffic (packets being exchanged, DNS lookps, and so on), and may be painful to use in such scenario, where you only wish to see the HTTP Requests.
So, you're better off using Fiddler, or a similar tool in a browser (ex: Chrome's Network inspect panel).
Crawljax is open source and can dynamically crawl Ajax-based content.
im starting the pseudo code of a new site, and want it to be as SEO friendly as possible.
the site i am creating is a booking agency site with c# and asp.net. essentially bands will register on the site with their availability and other info, and fill out their profile information with images etc. this info will be stored in a db.
creating this is not a problem, but i want the site to be a SEO friendly as possible.
I know google loves huge sites with great content. And all of these profile pages would be an excellent addition to my site for seo purposes. i also hear that google cannot see dynamically generated content when crawling a site.
i want to find a method of coding these pages, so google can see the content when it crawls them.
i need a pointer in the right direction for a solution for this. nothing is off limits - i will basically code my entire site around this principle, i just have no idea where to start looking for a solution. im not looking for a code solution, just what i should be researching to solve this issue.
Thanks in advance
i also hear that google cannot see dynamically generated content when crawling a site.
Google can see anything you can retrieve via http GET request (ie: there's a specific URL for it) and that someone either linked to or is listed in a published xml site map file.
To make sure that your profile pages fit this, you will want to make sure that profiles are all rendered via a single asp.net *.aspx file that determines which page is shown via a url parameter. Something that looks like this:
http://example.com/profiles.aspx?profile=SomeBandName
Now, you probably also want a friendly URL, that looks like this:
http://example.com/profiles/SomeBandName
To do that, you need to set up routing.
In order to crawl and index your pages by google or other search engine properly. Follow the following guidelines.
i: Page title must be precise and according to content available in page.
ii: Page url should be user friendly.
iii: Content is king (useful content)
iv: No ajax or javascript oriented way to load contents.
v: No flash or other media files. if exist must have description via alt tag.
vi: Create url sitemap of all static and dynamically generated contents.
vii: Submit sitemap to google and keep tracking how google crawl and index your pages.
fix issues contineously if google found via crawling.
In this way your most pages and content will be index properly and fastly.
I'd look into dynamic URL Rewriting.
Basically instead of having one page say http://localhost/Profile.aspx you'll have a bunch of simulated urls like
http://localhost/profiles/Band1
http://localhost/profiles/Band2
http://localhost/profiles/Band3
etc.
All of those will then map to back to the orgial profile.aspx page with a parameter so internally in your code it would look like http://localhost/Profile.aspx?Name=Band1, http://localhost/Profile.aspx?Name=Band2, etc
Basically your website appears to have a bunch of pages for each band but in reality they are all getting mapped back to the same asp.net page but have different parameters.
This is article I read about it some time back. http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
i also hear that google cannot see dynamically generated content when crawling a site.
you could create a sitemap.xml with the urls pointing to the dynamic profile pages. using google webmaster tools you can submit and monitor the crawling progress.
you may also create an index page or something similar ('browse by category' pages) that link to matching profile pages.
a reference for seo I regularly use is http://www.seomoz.org/learn-seo
Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack