I'm trying to allow users to post videos on my site by supplying only the URL. Right now I'm able to allow YouTube videos by just parsing the URL and obtaining the ID, and then inserting that ID into their given "embed" code and putting that on the page.
This limits me to only YouTube videos however, what I'm looking to do is something similar to facebook where you can put in the YouTube "Share" URL OR the url of the page directly, or any other video url, and it loads the video into their player.
Any idea how they do this? or any other comparable way to just show a video based just on a URL? Keep in mind that youtube videos (which would probably be most popular anyway) don't give the video url, but the url to the video on the YouTube page (which is why their embed code is needed with just the ID).
Hopefully this made sense, and I hope somebody might be able to offer me some advice on where to look!
Thanks guys.
I would suggest adding support for OpenGraph attributes, which are common among content services which work to enable other sites to embed their content. The information on the pages will be contained in their <meta> tags, which means you would have to load the URL via something like the HtmlAgilityPack:
var doc = new HtmlDocument();
doc.Load(webClient.OpenRead(url)); // not exactly production quality
var openGraph = new Dictionary<string, string>();
foreach (var meta in doc.DocumentNode.SelectNodes("//meta"))
{
var property = meta["property"];
var content = meta["content"];
if (property != null && property.Value.StartsWith("og:"))
{
openGraph[property.Value]
= content != null ? content.Value : String.Empty;
}
}
// Supported by: YouTube, Vimeo, CollegeHumor, etc
if (openGraph.ContainsKey("og:video"))
{
// 1. Get the MIME Type
string mime;
if (!openGraph.TryGetValue("og:video:type", out mime))
{
mime = "application/x-shockwave-flash"; // should error
}
// 2. Get width/height
string _w, _h;
if (!openGraph.TryGetValue("og:video:width", out _w)
|| !openGraph.TryGetValue("og:video:height", out _h))
{
_w = _h = "300"; // probably an error :)
}
int w = Int32.Parse(_w), h = Int32.Parse(_h);
Console.WriteLine(
"<embed src=\"{0}\" type=\"{1}\" width=\"{2}\" height=\"{3}\" />",
openGraph["og:video"],
mime,
w,
h);
}
Related
I would like to know the page views for a certain page/url only.
Is there a way to add the url to the Google query? Baseurl doesn't seem to work.
DataQuery PageViewQuery = new DataQuery(DataFeedUrl)
{
Ids = ProfileID,
Dimensions = "ga:date",
Metrics = "ga:pageviews",
Sort = "ga:date",
GAStartDate = (DateTime.Now).AddDays(-7).ToString("yyyy-MM-dd"),
GAEndDate = (DateTime.Now).ToString("yyyy-MM-dd")
};
You might try ga:pagePath (plus ga:hostname if you need to retrieve a full URL), or if you have unique page titles ga:pageTitle should be a valid workaround.
Full documentation of dimensions and metric is here: https://developers.google.com/analytics/devguides/reporting/core/dimsmets.
This is the code to get the links:
private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{
List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[#href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;
}
Sometimes the links im getting are starting like "/" or:
"/videos?feature=mh"
Or
"//www.youtube.com/my_videos_upload"
Im not sure if just "/" meaning a proper site or a site that start with "/videoes?...
Or "//www.youtube...
I need to get each time the links from a website that start with http or https maybe just www also count as a proper site. The question is what i define as a proper site address and a link and whats not ?
Im sure my getLinks function is not good the code is not the proper way it should be.
This is how im adding the links to the List:
private List<string> test(string url, int levels , DoWorkEventArgs eve)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;// = new List<string>();
List<string> csFiles = new List<string>();
try
{
doc = hw.Load(url);
webSites = getLinks(doc);
webSites is a List
After few times i see in the List sites like "/" or as above "//videoes... or "//www....
not sure if understood your question but
/Videos means it is accessing Videos folder from the root of the host you are accessing
ex:
www.somesite.com/Videos
There are absolute and relative Urls - so you are getting different flavors from different links, you need to make them absolute url appropriately (Uri class mostly will handle it for you).
foo/bar.txt - relative url from the same path as current page
../foo/bar.txt - relative path from one folder above current
/foo/bar.txt - server-relative pat from root - same server, path starting from root
//www.sample.com/foo/bar.txt - absolute url with the same scheme (http/https) as current page
http://www.sample.com/foo/bar.txt - complete absolute url
It looks like you are using a library that is able to parse/read html tags.
For my understanding
var href = link.Attributes["href"].Value;
is doing nothing but reading the value of the "href" attribute.
So assuming the website's source code is using links like href="/news"
it will grab and save even the relative links to your list.
Just view the target website's sourcecode and check it against your results.
Using C# , I want to read the Shares , Comments and Likes of a Google + post like this https://plus.google.com/107200121064812799857/posts/GkyGQPLi6KD
That post is an activity. This page includes infomation on how to get infomation about activities. This page gives some examples of using the API. This page has downloads for the Google API .NET library, which you can use to access the Google+ APIs, with XML documentation etc.
You'll need to use the API Console to get an API key and manage your API usage.
Also take a look at the API Explorer.
Here's a working example:
Referencing Google.Apis.dll and Google.Apis.Plus.v1.dll
PlusService plus = new PlusService();
plus.Key = "YOURAPIKEYGOESHERE";
ActivitiesResource ar = new ActivitiesResource(plus);
ActivitiesResource.Collection collection = new ActivitiesResource.Collection();
//107... is the poster's id
ActivitiesResource.ListRequest list = ar.List("107200121064812799857", collection);
ActivityFeed feed = list.Fetch();
//You'll obviously want to use a _much_ better way to get
// the activity id, but you aren't normally searching for a
// specific URL like this.
string activityKey = "";
foreach (var a in feed.Items)
if (a.Url == "https://plus.google.com/107200121064812799857/posts/GkyGQPLi6KD")
{
activityKey = a.Id;
break;
}
ActivitiesResource.GetRequest get = ar.Get(activityKey);
Activity act = get.Fetch();
Console.WriteLine("Title: "+act.Title);
Console.WriteLine("URL:"+act.Url);
Console.WriteLine("Published:"+act.Published);
Console.WriteLine("By:"+act.Actor.DisplayName);
Console.WriteLine("Annotation:"+act.Annotation);
Console.WriteLine("Content:"+act.Object.Content);
Console.WriteLine("Type:"+act.Object.ObjectType);
Console.WriteLine("# of +1s:"+act.Object.Plusoners.TotalItems);
Console.WriteLine("# of reshares:"+act.Object.Resharers.TotalItems);
Console.ReadLine();
Output:
Title: Wow Awesome creativity...!!!!!
URL:http://plus.google.com/107200121064812799857/posts/GkyGQPLi6KD
Published:2012-04-07T05:11:22.000Z
By:Funny Pictures & Videos
Annotation:
Content: Wow Awesome creativity...!!!!!
Type:note
# of +1s:210
# of reshares:158
I need to create a html parser, that given a blog url, it returns a list, with all the posts in the page.
I.e. if a page has 10 posts, it
should return a list of 10 divs,
where each div contains h1 and
a p
I can't use its rss feed, because I need to know exactly how it looks like for the user, if it has any ad, image etc and in contrast some blogs have just a summary of its content and the feed has it all, and vice-versa.
Anyway, I've made one that download its feed, and search the html for similar content, it works very well for some blogs, but not for others.
I don't think I can make a parser that works for 100% of the blogs it parses, but I want to make the best possible.
What should be the best approach? Look for tags that have its id attribute equal "post", "content"? Look for p tags? etc etc etc...
Thanks in advance for any help!
I don't think you will be successful on that. You might be able to parse one blog, but if the blog engine changes stuff, it won't work any more. I also don't think you'll be able to write a generic parser. You might even be partially successful, but it's going to be an ethereal success, because everything is so error prone on this context. If you need content, you should go with RSS. If you need to store (simply store) how it looks, you can also do that. But parsing by the way it looks? I don't see concrete success on that.
"Best possible" turns out to be "best reasonable," and you get to define what is reasonable. You can get a very large number of blogs by looking at how common blogging tools (WordPress, LiveJournal, etc.) generate their pages, and code specially for each one.
The general case turns out to be a very hard problem because every blogging tool has its own format. You might be able to infer things using "standard" identifiers like "post", "content", etc., but it's doubtful.
You'll also have difficulty with ads. A lot of ads are generated with JavaScript. So downloading the page will give you just the JavaScript code rather than the HTML that gets generated. If you really want to identify the ads, you'll have to identify the JavaScript code that generates them. Or, your program will have to execute the JavaScript to create the final DOM. And then you're faced with a problem similar to that above: figuring out if some particular bit of HTML is an ad.
There are heuristic methods that are somewhat successful. Check out Identifying a Page's Primary Content for answers to a similar question.
Use the HTML Agility pack. It is an HTML parser made for this.
I just did something like this for our company's blog which uses wordpress. This is good for us because our wordress blog hasn't changed in years, but the others are right in that if your html changes a lot, parsing becomes a cumbersome solution.
Here is what I recommend:
Using Nuget install RestSharp and HtmlAgilityPack. Then download fizzler and include those references in your project (http://code.google.com/p/fizzler/downloads/list).
Here is some sample code I used to implement the blog's search on my site.
using System;
using System.Collections.Generic;
using Fizzler.Systems.HtmlAgilityPack;
using RestSharp;
using RestSharp.Contrib;
namespace BlogSearch
{
public class BlogSearcher
{
const string Site = "http://yourblog.com";
public static List<SearchResult> Get(string searchTerms, int count=10)
{
var searchResults = new List<SearchResult>();
var client = new RestSharp.RestClient(Site);
//note 10 is the page size for the search results
var pages = (int)Math.Ceiling((double)count/10);
for (int page = 1; page <= pages; page++)
{
var request = new RestSharp.RestRequest
{
Method = Method.GET,
//the part after .com/
Resource = "page/" + page
};
//Your search params here
request.AddParameter("s", HttpUtility.UrlEncode(searchTerms));
var res = client.Execute(request);
searchResults.AddRange(ParseHtml(res.Content));
}
return searchResults;
}
public static List<SearchResult> ParseHtml(string html)
{
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var results = doc.DocumentNode.QuerySelectorAll("#content-main > div");
var searchResults = new List<SearchResult>();
foreach(var node in results)
{
bool add = false;
var sr = new SearchResult();
var a = node.QuerySelector(".posttitle > h2 > a");
if (a != null)
{
add = true;
sr.Title = a.InnerText;
sr.Link = a.Attributes["href"].Value;
}
var p = node.QuerySelector(".entry > p");
if (p != null)
{
add = true;
sr.Exceprt = p.InnerText;
}
if(add)
searchResults.Add(sr);
}
return searchResults;
}
}
public class SearchResult
{
public string Title { get; set; }
public string Link { get; set; }
public string Exceprt { get; set; }
}
}
Good luck,
Eric
How to screen scrape HTTPS using C#?
You can use System.Net.WebClient to start an HTTPS connection, and pull down the page to scrape with that.
Look into the Html Agility Pack.
You can use System.Net.WebClient to grab web pages. Here is an example: http://www.codersource.net/csharp_screen_scraping.html
If for some reason you're having trouble with accessing the page as a web-client or you want to make it seem like the request is from a browser, you could use the web-browser control in an app, load the page in it and use the source of the loaded content from the web-browser control.
Here's a concrete (albeit trivial) example. You can pass a ship name to VesselFinder in the querystring, but even if it only finds one ship with that name it still shows you the search results screen with one ship. This example detects that case and takes the user straight to the tracking map for the ship.
string strName = "SAFMARINE MAFADI";
string strURL = "https://www.vesselfinder.com/vessels?name=" + HttpUtility.UrlEncode(strName);
string strReturnURL = strURL;
string strToSearch = "/?imo=";
string strPage = string.Empty;
byte[] aReqtHTML;
WebClient objWebClient = new WebClient();
objWebClient.Headers.Add("User-Agent: Other"); //You must do this or HTTPS won't work
aReqtHTML = objWebClient.DownloadData(strURL); //Do the name search
UTF8Encoding utf8 = new UTF8Encoding();
strPage = utf8.GetString(aReqtHTML); // get the string from the bytes
if (strPage.IndexOf(strToSearch) != strPage.LastIndexOf(strToSearch))
{
//more than one instance found, so leave return URL as name search
}
else if (strPage.Contains(strToSearch) == true)
{
//find the ship's IMO
strPage = strPage.Substring(strPage.IndexOf(strToSearch)); //cut off the stuff before
strPage = strPage.Substring(0, strPage.IndexOf("\"")); //cut off the stuff after
}
strReturnURL = "https://www.vesselfinder.com" + strPage;