A C# parser for Web Links (RFC 5988) - c#

Anyone created an open source C# parser for Web Links HTTP "Link" header?
See: https://www.rfc-editor.org/rfc/rfc5988.
Example:
Link: <http://example.com/TheBook/chapter2>; rel="previous"; title="previous chapter"
Thanks.
Update: Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Here's an extension method I've used:
public static Dictionary<string, string> ParseLinksHeader(
this HttpResponseMessage response)
{
var links = new Dictionary<string, string>();
response.Headers.TryGetValues("link", out var headers);
if (headers == null) return links;
var matches = Regex.Matches(
headers.First(),
#"<(?<url>[^>]*)>;\s+rel=""(?<link>\w+)\""");
foreach(Match m in matches)
links.Add(m.Groups["link"].Value, m.Groups["url"].Value);
return links;
}

Take the HTML Agility Pack and use the right
SelectNodes
query.
using HtmlAgilityPack;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc =web.Load(url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#Link]"))
{
}

Related

How can I extract data from this site using HTMLAgilityPack?

I've been following tutorials on how to scrape information using HTMLAgilityPack, here is an example:
using System;
using System.Linq;
using System.Net;
namespace web_scraping_test
{
class Program
{
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.yellowpages.com/search?search_terms=Software&geo_location_terms=Sydney2C+ND");
var names = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var item in names)
{
Console.WriteLine(item.InnerText);
}
}
}
}
This was easy to get the data because there's a common class name and it's simple to get to
I'm trying to use this to scrape information from this site, https://osu.ppy.sh/beatmapsets/354163#osu/780200
but I have no idea about the correct markup to get 'Stitches
Shawn Mendes' and the values given in this diagram:Diagram
For the 'Shawn Mendes' the markup is '<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="https://osu.ppy.sh/beatmapsets?q=Shawn%20Mendes">Shawn Mendes</a>'
but I'm not sure about how to implement this into the code. I've replaced the url and have changed the classname but the directory of this text seems a lot more complicated on this site. Any advice would be appreciated, thanks!
All of the details you're looking for appear to be in a JSON object in the markup. There is a script block with the ID "json-beatmapset", if you scrape the content of that, and parse the JSON it contains, it should be smooth sailing after that.

GetSafeHtmlFragment removing all html tags

I am using GetSafeHtmlFragment in my website and I found that all of tags except <p> and <a> is removed.
I researched around and I found that there is no resolution for it from Microsoft.
Is there any superseded for it or is there any solution?
Thanks.
Amazing that Microsoft in the 4.2.1 version terribly overcompensated for a security leak in the 4.2 XSS library and now still hasn't updated a year later. The GetSafeHtmlFragment method should have been renamed to StripHtml as I read someone commenting somewhere.
I ended up using the HtmlSanitizer library suggested in this related SO issue. I liked that it was available as a package through NuGet.
This library basically implements a variation of the white-list approach the now accepted answer uses. However it is based on CsQuery instead of the HTML Agility library. The package also gives some additional options, like being able to keep style information (e.g. HTML attributes). Using this library resulted in code in my project something like below, which - at least - is a lot less code than the accepted answer :).
using Html;
...
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml = sanitizer.Sanitize(htmlString);
An alternative solution would be to use the Html Agility Pack in conjunction with your own tags white list :
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var whiteList = new[]
{
"#comment", "html", "head",
"title", "body", "img", "p",
"a"
};
var html = File.ReadAllText("input.html");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
var e = doc
.CreateNavigator()
.SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
.GetEnumerator();
while (e.MoveNext())
{
var node =
((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
.CurrentNode;
if (!whiteList.Contains(node.Name))
{
nodesToRemove.Add(node);
}
}
nodesToRemove.ForEach(node => node.Remove());
var sb = new StringBuilder();
using (var w = new StringWriter(sb))
{
doc.Save(w);
}
Console.WriteLine(sb.ToString());
}
}

Troubles with HtmlAgilityPack

I can't figure out what goes wrong. i just create the poject to test HtmlAgilityPack and what i've got.
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
namespace parseHabra
{
class Program
{
static void Main(string[] args)
{
HTTP net = new HTTP(); //some http wraper
string result = net.MakeRequest("http://stackoverflow.com/", null);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
//Get all summary blocks
HtmlNodeCollection news = doc.DocumentNode.SelectNodes("//div[#class=\"summary\"]");
foreach (HtmlNode item in news)
{
string title = String.Empty;
//trouble is here for each element item i get the same value
//all the time
title = item.SelectSingleNode("//a[#class=\"question-hyperlink\"]").InnerText.Trim();
Console.WriteLine(title);
}
Console.ReadLine();
}
}
}
It looks like i make xpath not for each node i've selected but to whole document. Any suggestions why it so ? Thx in advance.
I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting.
Try putting a . before the //
".//a[#class=\"question-hyperlink\"]"
I'd rewrite your xpath as a single query to find all the question titles, rather than finding the summaries then the titles. Chris' answer points out the problem which could have easily been avoided.
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com");
var xpath = "//div[starts-with(#id,'question-summary-')]//a[#class='question-hyperlink']";
var questionTitles = doc.DocumentNode
.SelectNodes(xpath)
.Select(a => a.InnerText.Trim());

C# Getting The HTML of the Links(Content) from Website

What I want is, opening a Link from a Website (from HtmlContent)
and get the Html of this new opened site..
Example: I have www.google.com, now I want to find all Links.
For each Link I want to have the HTMLContent of the new Site.
I do something like this:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "(.*?)";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
The other problem is, that I only get Links, not Linkbuttons (ASP.NET)..
How can I solve the problem?
Steps to follow:
Download Html Agility Pack
Reference the assembly you have downloaded in your project
Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
Replace what you have thrown away with a simple call to the Html Agility Pack parser.
Here's an example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[#href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}

How to clean up poorly formed HTML using HTML Agility Pack

I am attempting to replace this god awful collection of regular expressions that is currently used to clean up blocks of poorly formed HTML and stumbled upon the HTML Agility Pack for C#. It looks very powerful but yet, I couldn't find an example of how I want to use the pack which, in my mind, would be a desired functionality included in it. I am sure I am an idiot and cannot find a suitable method in the documentation.
Let me explain... say I had the following html:
<p class="someclass">
<font size="3">
<font face="Times New Roman">
this is some text
Some link
</font>
</font>
</p>
... that I want to look like:
<p>
this is some text
Some link
</p>
When I utilize the HtmlNode.Remove() method it removes the node plus all it's children. Is there a way to remove the node preserving the children?
On HtmlNode, the method RemoveChild has this overload:
public HtmlNode RemoveChild(HtmlNode oldChild, bool keepGrandChildren);
So this is how you would do it:
HtmlDocument doc = new HtmlDocument();
doc.Load("yourfile.htm");
foreach (HtmlNode font in doc.DocumentNode.SelectNodes("//font"))
{
font.ParentNode.RemoveChild(font, true);
}
EDIT: It looks like the Replace w/ keepGrandChildren option is not working as expected, so here is an alternate implementation:
public static HtmlNode RemoveChild(HtmlNode parent, HtmlNode oldChild, bool keepGrandChildren)
{
if (oldChild == null)
throw new ArgumentNullException("oldChild");
if (oldChild.HasChildNodes && keepGrandChildren)
{
HtmlNode prev = oldChild.PreviousSibling;
List<HtmlNode> nodes = new List<HtmlNode>(oldChild.ChildNodes.Cast<HtmlNode>());
nodes.Sort(new StreamPositionComparer());
foreach (HtmlNode grandchild in nodes)
{
parent.InsertAfter(grandchild, prev);
}
}
parent.RemoveChild(oldChild);
return oldChild;
}
// this helper class allows to sort nodes using their position in the file.
private class StreamPositionComparer : IComparer<HtmlNode>
{
int IComparer<HtmlNode>.Compare(HtmlNode x, HtmlNode y)
{
return y.StreamPosition.CompareTo(x.StreamPosition);
}
}
You could try using AngleSharp instead.
var parser = new HtmlParser();
var document = parser.Parse(html);
using (var writer = new StringWriter())
{
document.ToHtml(writer, new PrettyMarkupFormatter());
return writer.ToString();
}
Once you find the element use the InnerText method to get the text, Then do the remove and then insert the text.

Categories