Extracting Disqus comments from webpages using c# - c#

It is necessary for me to fetch user comments in each news page in CNN website, which uses disqus comment system. I have used c# for html parsing. Is there any specific code which I can use in order to extract the commented author and the comment using c#.
Thanks in advance,
Dinusha

Since the Disqus embed is a javascript embed, the comments won't be available in the page source unless the site renders them there. If you're scraping the page and letting the javascript render, then the first page (up to 50 comments) are available within the Disqus iFrame within the "postCompatContainer" DIV.
However, I'd instead suggest using the Disqus API to accomplish this. There's two main parts to this:
Get the thread information from the article
Specifically in the page source you have to find the variables 'disqus_shortname' and 'disqus_identifier' or 'disqus_url'. If 'disqus_identifier' or 'disqus_url' aren't available, then you can try using the window location address, but this is less reliable.
Make the API call with that data.
Specifically you'll want to use our threads/listPosts endpoint passing the 'disqus_shortname' as the 'forum' and the identifier or url as 'thread=ident:' or 'thread=link:', respectively.
I won't go into the specifics of using the API here, but we have a good starter tutorial here: http://help.disqus.com/customer/portal/articles/1131783-tutorial-get-comment-counts-with-the-api
and more examples here: https://github.com/disqus/DISQUS-API-Recipes

Related

How to display a working hyperlink in mvc .net

I currently have an .net Mvc web project that displays comments that are submitted by users and are stored in a database.
I was wondering if anyone had any information on how to embed a link in the comment so that is can be clicked on and followed to the desired well link :)
I am aware that I can google this however with the language in this given question finding an answer on google has been quite basic and not too accurate.
Well you can store the comments as html in the database then display it using Html.Raw but that opens up many security flaws. The best option is to setup some custom tag to represent a link such as [url link="example.com]click here[/url] and then using a razor helper parse that and transform it into a html link.

Making Dynamically Created ASP.net Page SEO Friendly

im starting the pseudo code of a new site, and want it to be as SEO friendly as possible.
the site i am creating is a booking agency site with c# and asp.net. essentially bands will register on the site with their availability and other info, and fill out their profile information with images etc. this info will be stored in a db.
creating this is not a problem, but i want the site to be a SEO friendly as possible.
I know google loves huge sites with great content. And all of these profile pages would be an excellent addition to my site for seo purposes. i also hear that google cannot see dynamically generated content when crawling a site.
i want to find a method of coding these pages, so google can see the content when it crawls them.
i need a pointer in the right direction for a solution for this. nothing is off limits - i will basically code my entire site around this principle, i just have no idea where to start looking for a solution. im not looking for a code solution, just what i should be researching to solve this issue.
Thanks in advance
i also hear that google cannot see dynamically generated content when crawling a site.
Google can see anything you can retrieve via http GET request (ie: there's a specific URL for it) and that someone either linked to or is listed in a published xml site map file.
To make sure that your profile pages fit this, you will want to make sure that profiles are all rendered via a single asp.net *.aspx file that determines which page is shown via a url parameter. Something that looks like this:
http://example.com/profiles.aspx?profile=SomeBandName
Now, you probably also want a friendly URL, that looks like this:
http://example.com/profiles/SomeBandName
To do that, you need to set up routing.
In order to crawl and index your pages by google or other search engine properly. Follow the following guidelines.
i: Page title must be precise and according to content available in page.
ii: Page url should be user friendly.
iii: Content is king (useful content)
iv: No ajax or javascript oriented way to load contents.
v: No flash or other media files. if exist must have description via alt tag.
vi: Create url sitemap of all static and dynamically generated contents.
vii: Submit sitemap to google and keep tracking how google crawl and index your pages.
fix issues contineously if google found via crawling.
In this way your most pages and content will be index properly and fastly.
I'd look into dynamic URL Rewriting.
Basically instead of having one page say http://localhost/Profile.aspx you'll have a bunch of simulated urls like
http://localhost/profiles/Band1
http://localhost/profiles/Band2
http://localhost/profiles/Band3
etc.
All of those will then map to back to the orgial profile.aspx page with a parameter so internally in your code it would look like http://localhost/Profile.aspx?Name=Band1, http://localhost/Profile.aspx?Name=Band2, etc
Basically your website appears to have a bunch of pages for each band but in reality they are all getting mapped back to the same asp.net page but have different parameters.
This is article I read about it some time back. http://weblogs.asp.net/scottgu/archive/2007/02/26/tip-trick-url-rewriting-with-asp-net.aspx
i also hear that google cannot see dynamically generated content when crawling a site.
you could create a sitemap.xml with the urls pointing to the dynamic profile pages. using google webmaster tools you can submit and monitor the crawling progress.
you may also create an index page or something similar ('browse by category' pages) that link to matching profile pages.
a reference for seo I regularly use is http://www.seomoz.org/learn-seo

How to programmatically scrape Fan Box Iframe using C# or JS/Jquery and obtain specific data?

I have a new design requirement that uses the facebook fan box unorthodoxically. Instead of 'hacking' the fan box, I'd rather scrape the fan box iframe (hidden) and retrieve the contents of the grid-item class in the .fan_box class (i.e. .fan_box .connections_grid .grid_item). I basically need the URLs to the face images and their links.
I'd prefer .NET methodology or JS/Jquery and something to get me started and pointed in the right direction. Please don't just provide a basic method to pull in a webpage to scrape, that's not the point. It's the iframe and accessing the data within it, is the challenge here.
I've not seen anyone try this and have searched thoroughly. I am not a expert so please give me more direction than you would give a brain-dead monkey. Thanks.
Why not just access the page's /feed connection via the API and display it in your own style?
Start here: https://developers.facebook.com/docs/reference/api/page/ (bear in mind you'll need an access token to access the feed)
A sample of the return type is here:
https://developers.facebook.com/tools/explorer/?method=GET&path=19292868552%2Ffeed

C# AJAX or Java response HTML scraping

Is there a way in C# to get the output of AJAX or Java? What I'm trying to do is grab the specifics of items on a webpage, however the webpage does not load it into the original source. Does anybody have a good tutorial or a good place to start?
For example, I would want to get all the car listings from http://www.madisonhonda.com/Preowned-Inventory.aspx#layout=layout1
If the DOM is being modified by javascript through ajax calls, and this modified data is what you are trying to capture then using a standard .NET WebClient won't work. You need to use a WebBrowser control so that it will actually execute the script, otherwise you will just be downloading the source.
If you need to just "load" it, then you'll need to understand how the page functions and try making the AJAX call yourself. Firebug and other similar tools allow you to see what requests are made by the browser.
There is no reason you cannot make the same web request from C# that the original page is making from Javascript. Depending on the architecture of the website, this could range in difficulty from constructing the proper URL with query string arguments (easy) to simulating a post with lots of page state (hard). The response content would most likely then be XML or JSON content instead of the HTML DOM, which if you're scraping for data will be a plus.
A long time ago I wrote a VB app to screen scrape financial sites and made it so that you could fire up multiple of these "harvester" screen scrapers. That might ease the time period loading data. We could do thousands of scrapes a day with multiple of these running on multiple boxes. Each harvester got its marching orders from information stored in the database, like what customer to get next and what was needed to scrape (balances, transaction history, etc.).
Like Michael said above, make a simple WinForms app with a WebBrowser control in it. You have to trap the DocumentComplete event. That should only fire when the web page is completely loaded. Then check out this post which gives an overview of how to do it.
Use the Html Agility Pack. It allows download of .html and scraping via XPath.
See How to use HTML Agility pack

How can I make an application in c# collect data from a website?

First of all, I hope my question doesn't bother you. I really need to get and idea of how I can accomplish that, but unfortunatelly, I'm really a beginner, I'm crawling when it comes to programming. I'm struggling to learn it the best way I can. I'll thank you for any help you give me.
Here's the task: I was ordered to find a way to collect some data from a website using a c# application. This will be done everyday, in order to update the data which we'll use to calculate some financial index.
I know my question might sound vague, anyway, even telling me how I can be more precise will help me. I know I seem to know desperate, but putting appart all the personell issues, my scholarship kind of depends on it.
Thanks in advance! (Please, don't mind the bad English, I'm brasilian and my English might not be that good yet.)
First, your English is fine. In fact, I thought you were a native speaker until you said otherwise.
The term you're looking for is 'site scraping'. Observe this question: Options for HTML scraping?. The second answer points to an HTML agility pack library you can use.
Now, there are two possibilities here. The first is you have to parse the HTML and scrape your data out of it. This is more computationally intensive and depends on the layout of the page. If they change the way the site looks, it could break the scraper.
The second possibility is they provide some XML or JSON web service you can consume. In this case you aren't scraping anything, but are rather using a true data feed. If the layout of the site changes, you will not break. Whether your target site supports this form of data feed is up to the site.
If I understand your question, you're being asked to do some Web Scraping, where you 1) download the contents of a web page and 2) try to parse data from that content.
For step #1, you should look into using a WebClient object in C# to download the HTML from the web page. You can give a WebClient object the URL you want to download the content from and obtain a String containing the content (probably HTML) of the URL.
How you go about doing step #2 depends on what content is present at the web site. If you know of certain patterns you're looking for in the HTML, you can search the HTML string using various methods. A more general solution for parsing HTML data can be found through using the Html Agility Pack, which will let you handle the HTML as a tree structure (DOM).
Use the WebClient class to get the page.
Turn the html into xml.
Use XPath to select the data you are interested in.
Ok, this is a pretty straightforward app design, and a lot of the code exists that you can reuse. Since you're a beginner, I'll break down into steps of what you need to do and recommend approaches.
1) You will use classes from System.Net to pull the web pages (WebClient being the easiest to usse). You will want to have this part of the program run on a timer if you can (using the scheduled jobs feature of the OS) and have it just pull the pages and drop them in a folder.
2) You have a second job which will run separately, pulling unread files from that folder, parsing them (using the HtmlAgility pack library is best) and then storing them in an index of some kind (Lucene is best for that)
3) You have a front end application of some sort (web or desktop) which queries that index for the information you're looking for.

Categories