How do I retrieve RSS Feeds based on a date range?
Specifically, how do I prepare the url so that I can get items that were published past a certain date?
www.pwop.com/feed.aspx?show=dotnetrocks&filetype=master&tags=Craftsmanship
Your questions is more related to the HTTP API of the site, not RSS it self.
RSS is a predefined XML data format.
Most RSS urls doesn't support filters and introduce simple URL which returns in RSS format the last X results (x is usually between 10 to 50 results).
Some URL allow to specify categories or Tags like in your example, so the reutrn RSS XML will contain only results from this tags.
If you don't want to miss results, you need to keep query the RSS URL every X minutes/hours depends on the update speeds of the results.
Other option is to contact the site and request a full API access or even to implement a feature to filter by date.
Not all websites support it, but maybe there is a solution that can work:
Websites usually have a sitemap.xml (or sitemap.xml.gz or sitemap.gz) file that contains all the urls in bulk or grouped in some way (e.g., by category, tag, month). The sitemap.xml can contain links to additional xmls and so on.
The main sitemap is typically located in the root of the site (e.g., https://news.bitcoin.com/sitemap.xml), but you can find more information about sitemaps here: https://www.sitemaps.org/protocol.html.
If a website has such an xml file, perhaps processing it will make it easier to extract the needed information without any special site crawler or API.
Related
I have an api that can pass a search query to a website that I use to lookup products. I use the catalog number to obtain the device identifier. The response that is returned is HTML, and I need to extract one line from the HTML to write to a file. Is it possible to select a specific div in a web api?
My goal is to eventually loop over each product search, pull the one line I need, and then write it to an excel file.
Here is an example of the api searching a product, and the response. api working
Here is the single line of code that I need to extract out of the response, I then want to concatenate it to the url and write the whole link out with each specific device identifier Line of code I need
I hope this makes sense.
This is a parsing problem, and since the file/content you want to extract from is HTML, it would be a straightforward task.
You have three main steps to get this done.
Parse the content, whether it's on the web, or downloaded file.
Use a selector to get the "a" tag you're looking for.
Extract the URL from the "href" attribute from the "a" tag.
I see you're using C#, so I would recommend this library, you will use its parser to parse the file, then the selector along with a CSS selector, to get your data.
Let me know if you still need more details.
I am thinking now how the rss applications are parsing the feeds, basically if I just want to parse XML from the feed I will use XMLReader
http://content.warframe.com/dynamic/rss.php
Based on this feed I will get exception about illegal path (this is less important), BUT I can put this link to another application (link at the bottom) and it will work...
There is w3c validator which shows many error
http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Fcontent.warframe.com%2Fdynamic%2Frss.php
If this rss feed really have so many "errors", why it is working with other applications? On example when i put it into:
http://feed.mikle.com/
I am building a news reader and I have an option for users to share article from blog, website, etc. by entering link to page. I am using two methods for now to determine the content of page:
I am trying to extract rss feed link from page user entered and then match that url in feed to get right item.
If site doesn't cointain feed or it's malformed or entered address differes from item link in rss(which is in about 50% cases if not more) I try to find og meta tags, and that works great but only bigger sites have that, smaller sites and blogs usually have even same meta description for whole website.
I am wondering how for example Google does it? When website doesn't cointain meta description Google somehow determines by itself what is content on page for their search results.
I am using HtmlAgilityPack to extract stuff from pages and my own methods to clean html to text.
Can someone explain me the logic or best approach to this, If I try to crawl it directly from top I usually end up with content from sidebar, navigation etc.?
I ended up using Boilerpipe which is written in JAVA,imported it using IKVM and it works well for pages that area formated correctly, but it still has troubles with some pages where content is scattered.
I am writing a very simple RSS reader - all it needs to do is get the xml doc, and print to the console the title and publish date of every item. I got started using these two questions:
How can I get started making a C# RSS Reader?
Reading the Stack Overflow RSS feed
I'm trying to figure out how to subscribe, and as far as I can figure you do it one of two ways. Send an HTTP request to the feed site so it pushes you updates as they come, or poll the site every X seconds and simply print the new ones.
I find it difficult the believe that there is no way to subscribe due to the millions of RSS readers running at any given moment, popular RSS sites like facebook, twitter, or myspace would be hit hundreds of millions of times per second due to all the RSS readers "subscribed" to it and look like a DOS attack.
So what is the "standard" way to subscribe to an RSS feed, if such a standard truely exists?
The standard way is to poll. Not every x seconds but every x minutes or x hours.
The reasoning behind RSS is to keep the feed extremely simple. Small download and the same file can be served to all subscribers (easy to cache in memory and no processing overhead to find out exactly what and when to send to each client).
Not sure you quite understand the concept of RSS feeds.
It is simple:
You application (RSS reader) sends an HTTP GET request to given RSS feed url.
You get XML in return.
You parse that XML and show that data on your UI.
And generally, the websites you mentioned are smart enough to identify DOS attacks (for example, frequent requests from same IP in very short time). So, you don't have to worry about that.
Also, while creating an RSS reader, every time you get new XML from feed url, you have to identify new posts from old ones (that you already have on your UI). Timestamps are generally used to identify posts, but, there no standard way of doing that.
RSS on a site / server does not manage any suscriptions. The suscription is only a concept in the RSS reader. That keeps stuff simple on the RSS server side, as there's no need for suscription management which made the protocol easy to adopt.
You have to periodically poll the RSS feed by an HTTP GET to the feed URL. You get a XML document in the RSS format in return. Then you parse it and display the infos you like. Voila.
I have seen many of website are displaying RSS Feeds on their website.
Example:
1) compgroups.net
2) velocityreviews.com
3) bytes.com
4) eggheadcafe.com
And many other websites.
What i observe is Google is even giving them good rank despite of duplicate content.
What i want to know is...
How can I find RSS Feeds? Also where can i found RSS Feeds for Yahoogroups?
Read here: http://help.yahoo.com/l/ca/yahoo/groups/rss/rss-03.html
The source of an RSS feed is XML. You can request a feed from c# using the HttpRequest, providing the url of the feed.
You read in the XML, process it and show its contents in your webpage.