Determening what is content in html page - c#

I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
If site doesn't cointain feed or it's malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn't cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?

I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.

Related

How to show appropriate page on mobile

I want to capture some blog from some blog sites. I know to use HttpClient to get the html string, and then use Html Agility Pack to capture the content under the specific html tag. But if you use WebView to show this html string, you will find that it's not good in mobile. For example, css style will not be loaded correctly. Some code-blocks will not auto wrap. Some pictures will not show (It will show x).
Some advertisements also will show, but I don't want it.
Do anyone know how to get it? Any suggestions will be apprieciate.
Try running the html string through something like Google Mobilizer. This should make a more mobile friendly html string which you can then use the Agility pack to 'unpack'
Ideally you should capture the HTML page and all its associated resources: CSS files, images, scripts, ...
And then updates the HTML content so that resources are retrieved from your local data storage (for example, relative URL will not work anymore if you saved the HTML page locally).
You may also send your HTTP Request with a User-Agent header that corresponds to the one used by Microsoft browser in order to obtain the corresponding version from the website (if they do some kind of User-Agent sniffing).

Getting actual content from RSS feed

Here for example there is a link for ABC news which gives various RSS feeds to consume.
http://rss.cnn.com/rss/edition.rss`
Using this feeds in Windows 8 store app, I am able to read it using built in SyndicationClient class. However, it gives only title and few summary text for the news story/article and not all content. Now I want to have all content i.e. Text and Image. I saw many news reader app for Windows store and they are doing it pretty much easily when I tap on any story and it gives me actual content right there.
Any idea how to accomplish this? Do I need some sort of html parser here?
You can have a look at News, News Bento app for example. I want to achieve something similar.
Here are the images from the app:
This is extracted text and images from the news article:
This is the view when you click on "View Original Article". I know that view below is using webview control. But I want how to extract data like image above.
Well, answer is readablity. More here as well:
https://github.com/scottksmith95/CSharp.Readability
It took me lot of time to find out this stuff but it is exactly what I wanted.

How to make a link live in a pdf with xfinium

I am trying to build a pdf tool with xfinium library and I would like to know if it is possible to retrieve the pdf links of a pdf to make them live when displayed in my app. For now I can only see them as text and it is not possible to click on them so they are not useful. I have looked in the samples of xfinium but I couldn't find any hint to what I should change to make them work.
Any help would be great.
Thanks a lot.
Links in the PDF file are stored as link annotations. You can retrieve these links in this way: load your file in a PdfFixedDocument. The document's Pages collection is populated automatically with all the pages in the document.
Each page has an Annotations collection which is populated automatically with all the annotations on the page. Loop through this collection and test which annotation is a link annotation. The position of the link on the page is given by the VisualRectangle property.
If you need the link's URL you have to inspect the Action property of the link annotation. If it is an URI action then URI property of the action will give you the link's URL.
Disclaimer: I work for the company that develops XFINIUM.PDF library.

get current page number of pdf document in asp.net

I am trying to implement a feature where i open (suppose in iframe) a PDF file (multiple pages), Highlight a section of the document a get the page number (the one that is displayed in the PDF tool bar).
Eg: if the toolbar display 2/7 which means i am right now in page 2, i need to capture the page number information. Sounds simple but i am not able to get a .dll/function that exposes this property.
Any help would be grateful.Thanks.
I wouldn't think this would be possible, there's no way to control PDFs with JavaScript in the browser, which is what you'd need to do.
This article suggests the same: http://codingforums.com/showthread.php?t=43436.
Content of link:
in short, no, you can't do that.
really don't think JS can read properties of PDFs, since PDFs are viewed in the browser thru a plugin, ie a viewport for another application (for want of a better explanation).
You may be better trying a different route, such as generating the pages as images and implementing your own paging. Depends on your content and requirements, of course. ABCPDF from http://www.websupergoo.com/ is free (with a link-back), not sure if that's any help for you.

Google image search

In C#, how can I extract the URL's of any images found when performing a search with Google? I'm writing a little app to get the artwork for my ripped cd's. I played around with the Amazon service but found the results I received were erratic. I can't be bothered to learn the whole Amazon API just for this simple little app though, so thought I'd try Google instead.
So far, I've performed the search and got the result page's source, but I'm not sure how to extract the URL's from it. I know I have to use Regex but have no idea what expression to use. All the one's I've found seem to be broken. Any help would be appreciated.
Try using the HTML Agility Pack. It works wonders on scraping content.
It lives here on Codeplex.
I used it to scrape a user ranking list from so.com, and loved it.
It will let you select a node of html and then query subnodes using XSLT.

Categories