I need to create a html parser, that given a blog url, it returns a list, with all the posts in the page.
I.e. if a page has 10 posts, it
should return a list of 10 divs,
where each div contains h1 and
a p
I can't use its rss feed, because I need to know exactly how it looks like for the user, if it has any ad, image etc and in contrast some blogs have just a summary of its content and the feed has it all, and vice-versa.
Anyway, I've made one that download its feed, and search the html for similar content, it works very well for some blogs, but not for others.
I don't think I can make a parser that works for 100% of the blogs it parses, but I want to make the best possible.
What should be the best approach? Look for tags that have its id attribute equal "post", "content"? Look for p tags? etc etc etc...
Thanks in advance for any help!
I don't think you will be successful on that. You might be able to parse one blog, but if the blog engine changes stuff, it won't work any more. I also don't think you'll be able to write a generic parser. You might even be partially successful, but it's going to be an ethereal success, because everything is so error prone on this context. If you need content, you should go with RSS. If you need to store (simply store) how it looks, you can also do that. But parsing by the way it looks? I don't see concrete success on that.
"Best possible" turns out to be "best reasonable," and you get to define what is reasonable. You can get a very large number of blogs by looking at how common blogging tools (WordPress, LiveJournal, etc.) generate their pages, and code specially for each one.
The general case turns out to be a very hard problem because every blogging tool has its own format. You might be able to infer things using "standard" identifiers like "post", "content", etc., but it's doubtful.
You'll also have difficulty with ads. A lot of ads are generated with JavaScript. So downloading the page will give you just the JavaScript code rather than the HTML that gets generated. If you really want to identify the ads, you'll have to identify the JavaScript code that generates them. Or, your program will have to execute the JavaScript to create the final DOM. And then you're faced with a problem similar to that above: figuring out if some particular bit of HTML is an ad.
There are heuristic methods that are somewhat successful. Check out Identifying a Page's Primary Content for answers to a similar question.
Use the HTML Agility pack. It is an HTML parser made for this.
I just did something like this for our company's blog which uses wordpress. This is good for us because our wordress blog hasn't changed in years, but the others are right in that if your html changes a lot, parsing becomes a cumbersome solution.
Here is what I recommend:
Using Nuget install RestSharp and HtmlAgilityPack. Then download fizzler and include those references in your project (http://code.google.com/p/fizzler/downloads/list).
Here is some sample code I used to implement the blog's search on my site.
using System;
using System.Collections.Generic;
using Fizzler.Systems.HtmlAgilityPack;
using RestSharp;
using RestSharp.Contrib;
namespace BlogSearch
{
public class BlogSearcher
{
const string Site = "http://yourblog.com";
public static List<SearchResult> Get(string searchTerms, int count=10)
{
var searchResults = new List<SearchResult>();
var client = new RestSharp.RestClient(Site);
//note 10 is the page size for the search results
var pages = (int)Math.Ceiling((double)count/10);
for (int page = 1; page <= pages; page++)
{
var request = new RestSharp.RestRequest
{
Method = Method.GET,
//the part after .com/
Resource = "page/" + page
};
//Your search params here
request.AddParameter("s", HttpUtility.UrlEncode(searchTerms));
var res = client.Execute(request);
searchResults.AddRange(ParseHtml(res.Content));
}
return searchResults;
}
public static List<SearchResult> ParseHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode.QuerySelectorAll("#content-main > div");
var searchResults = new List<SearchResult>();
foreach(var node in results)
{
bool add = false;
var sr = new SearchResult();
var a = node.QuerySelector(".posttitle > h2 > a");
if (a != null)
{
add = true;
sr.Title = a.InnerText;
sr.Link = a.Attributes["href"].Value;
}
var p = node.QuerySelector(".entry > p");
if (p != null)
{
add = true;
sr.Exceprt = p.InnerText;
}
if(add)
searchResults.Add(sr);
}
return searchResults;
}
}
public class SearchResult
{
public string Title { get; set; }
public string Link { get; set; }
public string Exceprt { get; set; }
}
}
Good luck,
Eric
Related
Hello SO folks and more specifically Google folks monitoring this tag per your support page. I am working from .NET and PlaylistItems.List("snippet,contentDetails") does not do a whole lot compared to the old RSS Feed search. In fact adding part contentDetails adds little value in that only the VideoID is now returned but it is already part of Snippet.ResourceId.VideoId
"kind": "youtube#playlistItem",
bla,
bla,
"contentDetails": {
"videoId": "DLME0PsJRnk"
}
Why add a "part" which is only going to return one bit of information?
How about supporting something like "snippet,contentDetails(duration,PublishedAt,Views)"
I feel this is kind of basic metadata (snippet) most apps would want to list to the users.
While you are at it please please remove this non-sense of Java casing of parameters. Why would you leak-out your language of choice into an API, that's really sad. Yes it is frustrating to keep checking whether I case-spelled them correctly.
Well, it looks like you are forcing "us" to build a list of VideoIds than turn around and make more API calls when I was doing it previously with fewer.
It also means, I will have to manage the 50 items max paging twice, once for the playlist if it is over 50 videos and then manage manually my list of VideosIds paging when I turn around to make Videos.List calls.
Let me know if I missed an All-In-One call type of API, thank you.
Here is what I have now working, let me know if there is a better way
// 20150802
public async Task<List<YouTubeInfo>> PlaylistVideosInfo(String PlaylistID)
{
var YoutubeService = YouTubeService();
//
List<YouTubeInfo> VideoInfos = new List<YouTubeInfo>();
//
var NextPageToken = "";
while (NextPageToken != null)
{
//
var SearchListRequest = YoutubeService.PlaylistItems.List("snippet");
SearchListRequest.PlaylistId = PlaylistID;
SearchListRequest.MaxResults = 50;
SearchListRequest.PageToken = NextPageToken;
// Call the search.list method to retrieve results matching the specified query term.
var SearchListResponse = await SearchListRequest.ExecuteAsync();
// Collect Video IDs from this page
var VideoIDsBatch = new List<string>(); // batch Video detail search by 50 max
foreach (var searchResult in SearchListResponse.Items)
{
VideoIDsBatch.Add(searchResult.Snippet.ResourceId.VideoId);
}
// Make API call for this batch - expect a single page :(
var VideoListRequest = YoutubeService.Videos.List("snippet,contentDetails");
VideoListRequest.Id = String.Join(",", VideoIDsBatch);
VideoListRequest.MaxResults = 50;
var VideoListResponse = await VideoListRequest.ExecuteAsync();
// Collect each Video details
foreach (var VideoResult in VideoListResponse.Items)
{
YouTubeInfoAdd(VideoInfos, VideoResult);
}
// request next page
NextPageToken = SearchListResponse.NextPageToken;
}
// Return All Videos' detail
return VideoInfos;
}
I decided to help my friend with a project he's working on. I'm trying to write a test webpage for him to verify some new functionality, but in my auto-generated code I get
CS1106: Extension method must be defined in a non-generic static class
Implementing the code in index.cshtml isn't the best way to do this, but we are just trying to do a proof of concept and will do a proper implementation later.
In all the places I looked they pretty much said that all the functions I define must be in a static class (as the error states). That wouldn't be so bad except for the class that holds all my functions is auto-generated and not static. I'm not really sure what settings I can change to fix this.
Here is a copy of the relevant (I believe) parts of code. The implementation of some or all of the functions may be incorrect. I haven't tested them yet
#{
HttpRequest req = System.Web.HttpContext.Current.Request;
HttpResponse resp = System.Web.HttpContext.Current.Response;
var url = req.QueryString["url"];
//1 Download web data from URL
//2 Write the final edited version of the document to the response object using resp.write(String x);
//3 Add Script tag for dom-outline-1.0 to html agility pack document
//4 Search for relative URLs and correct them to become absolute URL's that point back to the hostname
}
#functions
{
public static void PrintNodes(this HtmlAgilityPack.HtmlNode tag)
{
HttpResponse resp = System.Web.HttpContext.Current.Response;
resp.Write(tag.Name + tag.InnerHtml);
if (!tag.HasChildNodes)
{
return;
}
PrintNodes(tag.FirstChild);
}
public static void AddScriptNode(this HtmlAgilityPack.HtmlNode headNode, HtmlAgilityPack.HtmlDocument htmlDoc, string filePath)
{
string content = "";
using (StreamReader rdr = File.OpenText(filePath))
{
content = rdr.ReadToEnd();
}
if (headNode != null)
{
HtmlAgilityPack.HtmlNode scripts = htmlDoc.CreateElement("script");
scripts.Attributes.Add("type", "text/javascript");
scripts.AppendChild(htmlDoc.CreateComment("\n" + content + "\n"));
headNode.AppendChild(scripts);
}
}
}
<HTML CODE HERE>
If you were really smart you would encapsulate the design to take Delegates, reason being if you use a delegate you don't have to worry about referencing something static.
public delegate void MyUrlThing(string url, object optional = null);
Possibly some state...
public enum UrlState
{
None,
Good,
Bad
}
Then void would become UrlState...
Also if you wanted you could also setup a text box and blindly give it CIL....
Then you would compile the delegates using something like this
http://www.codeproject.com/Articles/578116/Complete-Managed-Media-Aggregation-Part-III-Quantu
This way you can use also then optionally just use the IL to augment whatever you wanted.
You could also give it CSharp code I suppose...
If you want to keep you design you can also then optionally use interfaces... and then put the compiled dll in a directory and then load it etc... as traditionally
So I am starting to learn how to use XML data within a app and decided to use some free data to do this however I cannot for the life of me get it working this is my code so far. (I have done a few apps with static data before but hey apps are designed to use the web right? :p)
public partial class MainPage : PhoneApplicationPage
{
List<XmlItem> xmlItems = new List<XmlItem>();
// Constructor
public MainPage()
{
InitializeComponent();
LoadXmlItems("http://hatrafficinfo.dft.gov.uk/feeds/datex/England/CurrentRoadworks/content.xml");
test();
}
public void test()
{
foreach (XmlItem item in xmlItems)
{
testing.Text = item.Title;
}
}
public void LoadXmlItems(string xmlUrl)
{
WebClient client = new WebClient();
client.OpenReadCompleted += (sender, e) =>
{
if (e.Error != null)
return;
Stream str = e.Result;
XDocument xdoc = XDocument.Load(str);
***xmlItems = (from item in xdoc.Descendants("situation id")
select new XmlItem()
{
Title = item.Element("impactOnTraffic").Value,
Description = item.Element("trafficRestrictionType").Value
}).ToList();***
// close
str.Close();
// add results to the list
xmlItems.Clear();
foreach (XmlItem item in xmlItems)
{
xmlItems.Add(item);
}
};
client.OpenReadAsync(new Uri(xmlUrl, UriKind.Absolute));
}
}
I am basically trying to learn how to do this at the moment as I am intrigued how to actually do it (I know there are many ways but ATM this way seems the easiest) I just don't get what the error is ATM. (The bit in * is where it says the error is)
I also know the display function ATM is not great (As it will only show the last item) but for testing this will do for now.
To some this may seem easy, as a learner its not so easy for me just yet.
The error in picture form:
(It seems I cant post images :/)
Thanks in advance for the help
Edit:
Answer below fixed the error :D
However still nothing is coming up. I "think" it's because of the XML layout and the amount of descendants it has (Cant work out what I need to do being a noob at XML and pulling it from the web as a data source)
Maybe I am starting too complicated :/
Still any help/tips on how to pull some elements from the feed (As there all in Descendants) correctly and store them would be great :D
Edit2:
I have it working (In a crude way) but still :D
Thanks Adam Maras!
The last issue was the double listing. (Adding it to a list, to then add it to another list was causing a null exception) Just using the 1 list within the method solved this issue, (Probably not the best way of doing it but it works for now) and allowed for me to add the results to a listbox until I spend some time working out how to use ListBox.ItemTemplate & DataTemplate to make it look more appealing. (Seems easy enough I say now...)
Thanks Again!!!
from item in xdoc.Descendants("situation id")
// ^
XML tag names can't contain spaces. Looking at the XML, you probably just want "situation" to match the <situation> elements.
After looking at your edit and further reviewing the XML, I figured out what the problem is. If you look at the root element of the document:
<d2LogicalModel xmlns="http://datex2.eu/schema/1_0/1_0" modelBaseVersion="1.0">
You'll see that it has a default namespace applied. The easiest solution to your problem will be to first get the namespsace from the root element:
var ns = xdoc.Root.Name.Namespace;
And then apply it wherever you're using a string to identify an element or attribute name:
from item in xdoc.Descendants(ns + "situation")
// ...
item.Element(ns + "impactOnTraffic").Value
item.Element(ns + "trafficRestrictionType").Value
One more thing: <impactOnTraffic> and <trafficRestrictionType> aren't direct children of the <situation> element, so you'll need to change that code as well:
Title = items.Descendants(ns + "impactOnTraffic").Single().Value,
Description = item.Descendants(ns + "trafficRestrictionType").Single().Value
I'm trying to get an XML-text from a wabpage, that is already opened in IE. Web requests are not allowed because of a security of target page (long boring story with certificates etc). I use method to walk through all opened pages and, if I found a match with page's URI, I need to get it's XML.
Some time ago I needed to get an HTML-code between body tags. I used method with IHTMLDocument2 like this:
private string GetSourceHTML()
{
Regex reg = new Regex(patternURL);
Match match;
string result;
foreach (SHDocVw.InternetExplorer ie in shellWindows)
{
match = reg.Match(ie.LocationURL.ToString());
if (!string.IsNullOrEmpty(match.Value))
{
mshtml.IHTMLDocument2 doc = (mshtml.IHTMLDocument2)ie.Document;
result = doc.body.innerHTML.ToString();
return result;
}
}
result = string.Empty;
return result;
}
So now I need to get a whole XML-code of a target page. I've googled a lot, but didn't find anything useful. Any ideas? Thanks.
Have you tried this? It should get the HTML, which hopefully you could parse to XML?
Retrieving the HTML source code
I'm trying to allow users to post videos on my site by supplying only the URL. Right now I'm able to allow YouTube videos by just parsing the URL and obtaining the ID, and then inserting that ID into their given "embed" code and putting that on the page.
This limits me to only YouTube videos however, what I'm looking to do is something similar to facebook where you can put in the YouTube "Share" URL OR the url of the page directly, or any other video url, and it loads the video into their player.
Any idea how they do this? or any other comparable way to just show a video based just on a URL? Keep in mind that youtube videos (which would probably be most popular anyway) don't give the video url, but the url to the video on the YouTube page (which is why their embed code is needed with just the ID).
Hopefully this made sense, and I hope somebody might be able to offer me some advice on where to look!
Thanks guys.
I would suggest adding support for OpenGraph attributes, which are common among content services which work to enable other sites to embed their content. The information on the pages will be contained in their <meta> tags, which means you would have to load the URL via something like the HtmlAgilityPack:
var doc = new HtmlDocument();
doc.Load(webClient.OpenRead(url)); // not exactly production quality
var openGraph = new Dictionary<string, string>();
foreach (var meta in doc.DocumentNode.SelectNodes("//meta"))
{
var property = meta["property"];
var content = meta["content"];
if (property != null && property.Value.StartsWith("og:"))
{
openGraph[property.Value]
= content != null ? content.Value : String.Empty;
}
}
// Supported by: YouTube, Vimeo, CollegeHumor, etc
if (openGraph.ContainsKey("og:video"))
{
// 1. Get the MIME Type
string mime;
if (!openGraph.TryGetValue("og:video:type", out mime))
{
mime = "application/x-shockwave-flash"; // should error
}
// 2. Get width/height
string _w, _h;
if (!openGraph.TryGetValue("og:video:width", out _w)
|| !openGraph.TryGetValue("og:video:height", out _h))
{
_w = _h = "300"; // probably an error :)
}
int w = Int32.Parse(_w), h = Int32.Parse(_h);
Console.WriteLine(
"<embed src=\"{0}\" type=\"{1}\" width=\"{2}\" height=\"{3}\" />",
openGraph["og:video"],
mime,
w,
h);
}