Troubles with HtmlAgilityPack

Troubles with HtmlAgilityPack - c#

I can't figure out what goes wrong. i just create the poject to test HtmlAgilityPack and what i've got.
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
namespace parseHabra
{
class Program
{
static void Main(string[] args)
{
HTTP net = new HTTP(); //some http wraper
string result = net.MakeRequest("http://stackoverflow.com/", null);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
//Get all summary blocks
HtmlNodeCollection news = doc.DocumentNode.SelectNodes("//div[#class=\"summary\"]");
foreach (HtmlNode item in news)
{
string title = String.Empty;
//trouble is here for each element item i get the same value
//all the time
title = item.SelectSingleNode("//a[#class=\"question-hyperlink\"]").InnerText.Trim();
Console.WriteLine(title);
}
Console.ReadLine();
}
}
}
It looks like i make xpath not for each node i've selected but to whole document. Any suggestions why it so ? Thx in advance.

I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting.
Try putting a . before the //
".//a[#class=\"question-hyperlink\"]"

I'd rewrite your xpath as a single query to find all the question titles, rather than finding the summaries then the titles. Chris' answer points out the problem which could have easily been avoided.
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com");
var xpath = "//div[starts-with(#id,'question-summary-')]//a[#class='question-hyperlink']";
var questionTitles = doc.DocumentNode
.SelectNodes(xpath)
.Select(a => a.InnerText.Trim());

Related

XML dialogue tree in Unity not parsing information correctly

I have tried to setup a Dialogue tree within unity using XML (I have not used XML much before so am unsure if the way i am going is correct at all)
So I am trying to get the first text element from this dialogue tree but when i call the XML file and say where it is i am getting the everything stored in that branch.
Am i using the correct .XML to be able to do this also as i seen people say use .XML.LINQ or .XML.Serialization not just .XML is this correct for my case ??
Code:
using UnityEngine;
using System.Collections;
using System.IO;
using System.Xml;
using UnityEngine.UI;
using System.Collections.Generic;
public class DialogTree
{
public string text;
public List<string> dialogText;
public List<DialogTree> nodes;
public void parseXML(string xmlData)
{
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(new StringReader(xmlData));
XmlNode node = xmlDoc.SelectSingleNode("dialoguetree/dialoguebranch");
text = node.InnerXml;
XmlNodeList myNodeList = xmlDoc.SelectNodes("dialoguebranch/dialoguebranch");
foreach (XmlNode node1 in myNodeList)
{
if (node1.InnerXml.Length > 0)
{
DialogTree dialogtreenode = new DialogTree();
dialogtreenode.parseXML(node1.InnerXml);
nodes.Add(dialogtreenode);
}
}
}
}
And here is a picture of the XML.
So i am trying to grab the first element of text then late on there response it will go to branch 1 or 2
<?xml version='1.0'?>
<dialoguetree>
<dialoguebranch>
<text>Testing if the test prints</text>
<dialoguebranch>
<text>Branch 1</text>
<dialoguebranch>
<text>Branch 1a</text>
</dialoguebranch>
<dialoguebranch>
<text>Branch 1b</text>
</dialoguebranch>
</dialoguebranch>
<dialoguebranch>
<text>Branch 2</text>
</dialoguebranch>
</dialoguebranch>
</dialoguetree>

You're getting everything in that branch because XmlNode.InnerXML returns everything in that node. See the documentation for more information on that.
You should use the branch as the base for only looking at its children, instead of starting at xmlDoc every time. Also, you need an entry point to get inside of the first dialoguetree element and then ignore that. Finally, I would only create one XmlDocument and just pass around nodes in your recursion.
Altogether, this might look like this:
public class DialogTree
{
public string text;
public List<DialogTree> nodes = new List<DialogTree>();
public static DialogTree ParseXMLStart(string xmlData)
{
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load(new Stringreader(xmlData));
XmlNode rootNode = xmlDoc.SelectSingleNode("dialoguetree/dialoguebranch");
DialogTree dialogTree = new DialogTree();
dialogTree.ParseXML(rootNode);
return dialogTree;
}
public void ParseXML(XmlNode parentNode)
{
XmlNode textNode = parentNode.SelectSingleNode("text");
text = textNode.InnerText;
XmlNodeList myNodeList = parentNode.SelectNodes("dialoguebranch");
foreach (XmlNode curNode in myNodeList)
{
if (curNode.InnerXml.Length > 0)
{
DialogTree dialogTree = new DialogTree();
dialogTree.ParseXML(curNode);
nodes.Add(dialogTree);
}
}
}
}
And you could use it like so:
string xmlStringFromFile;
DialogTree dialogue = DialogTree.ParseXMLStart(xmlStringFromFile);
All of this code is untested but I hope the general idea is clear. Let me know if you find any errors in the comments below and I will try to fix them.

How can I extract data from this site using HTMLAgilityPack?

I've been following tutorials on how to scrape information using HTMLAgilityPack, here is an example:
using System;
using System.Linq;
using System.Net;
namespace web_scraping_test
{
class Program
{
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.yellowpages.com/search?search_terms=Software&geo_location_terms=Sydney2C+ND");
var names = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var item in names)
{
Console.WriteLine(item.InnerText);
}
}
}
}
This was easy to get the data because there's a common class name and it's simple to get to
I'm trying to use this to scrape information from this site, https://osu.ppy.sh/beatmapsets/354163#osu/780200
but I have no idea about the correct markup to get 'Stitches
Shawn Mendes' and the values given in this diagram:Diagram
For the 'Shawn Mendes' the markup is '<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="https://osu.ppy.sh/beatmapsets?q=Shawn%20Mendes">Shawn Mendes</a>'
but I'm not sure about how to implement this into the code. I've replaced the url and have changed the classname but the directory of this text seems a lot more complicated on this site. Any advice would be appreciated, thanks!

All of the details you're looking for appear to be in a JSON object in the markup. There is a script block with the ID "json-beatmapset", if you scrape the content of that, and parse the JSON it contains, it should be smooth sailing after that.

GetSafeHtmlFragment removing all html tags

I am using GetSafeHtmlFragment in my website and I found that all of tags except <p> and <a> is removed.
I researched around and I found that there is no resolution for it from Microsoft.
Is there any superseded for it or is there any solution?
Thanks.

Amazing that Microsoft in the 4.2.1 version terribly overcompensated for a security leak in the 4.2 XSS library and now still hasn't updated a year later. The GetSafeHtmlFragment method should have been renamed to StripHtml as I read someone commenting somewhere.
I ended up using the HtmlSanitizer library suggested in this related SO issue. I liked that it was available as a package through NuGet.
This library basically implements a variation of the white-list approach the now accepted answer uses. However it is based on CsQuery instead of the HTML Agility library. The package also gives some additional options, like being able to keep style information (e.g. HTML attributes). Using this library resulted in code in my project something like below, which - at least - is a lot less code than the accepted answer :).
using Html;
...
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml = sanitizer.Sanitize(htmlString);

An alternative solution would be to use the Html Agility Pack in conjunction with your own tags white list :
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var whiteList = new[]
{
"#comment", "html", "head",
"title", "body", "img", "p",
"a"
};
var html = File.ReadAllText("input.html");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
var e = doc
.CreateNavigator()
.SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
.GetEnumerator();
while (e.MoveNext())
{
var node =
((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
.CurrentNode;
if (!whiteList.Contains(node.Name))
{
nodesToRemove.Add(node);
}
}
nodesToRemove.ForEach(node => node.Remove());
var sb = new StringBuilder();
using (var w = new StringWriter(sb))
{
doc.Save(w);
}
Console.WriteLine(sb.ToString());
}
}

A C# parser for Web Links (RFC 5988)

Anyone created an open source C# parser for Web Links HTTP "Link" header?
See: https://www.rfc-editor.org/rfc/rfc5988.
Example:
Link: <http://example.com/TheBook/chapter2>; rel="previous"; title="previous chapter"
Thanks.
Update: Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Ended up creating my own parser: https://github.com/JornWildt/Ramone/blob/master/Ramone/Utility/WebLinkParser.cs. Feel free to use it.

Here's an extension method I've used:
public static Dictionary<string, string> ParseLinksHeader(
this HttpResponseMessage response)
{
var links = new Dictionary<string, string>();
response.Headers.TryGetValues("link", out var headers);
if (headers == null) return links;
var matches = Regex.Matches(
headers.First(),
#"<(?<url>[^>]*)>;\s+rel=""(?<link>\w+)\""");
foreach(Match m in matches)
links.Add(m.Groups["link"].Value, m.Groups["url"].Value);
return links;
}

Take the HTML Agility Pack and use the right
SelectNodes
query.
using HtmlAgilityPack;
namespace WebScraper
{
class Program
{
static void Main(string[] args)
{
HtmlWeb web = new HtmlWeb();
HtmlDocument doc =web.Load(url);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#Link]"))
{
}

C# Getting The HTML of the Links(Content) from Website

What I want is, opening a Link from a Website (from HtmlContent)
and get the Html of this new opened site..
Example: I have www.google.com, now I want to find all Links.
For each Link I want to have the HTMLContent of the new Site.
I do something like this:
foreach (String link in GetLinksFromWebsite(htmlContent))
{
using (var client = new WebClient())
{
htmlContent = client.DownloadString("http://" + link);
}
foreach (Match treffer in istBildURL)
{
string bildUrl = treffer.Groups[1].Value;
bildLinks.Add(bildUrl);
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
string linkPattern = "(.*?)";
MatchCollection linkMatches = Regex.Matches(htmlSource, linkPattern, RegexOptions.Singleline);
List<string> linkContents = new List<string>();
foreach (Match match in linkMatches)
{
linkContents.Add(match.Value);
}
return linkContents;
}
The other problem is, that I only get Links, not Linkbuttons (ASP.NET)..
How can I solve the problem?

Steps to follow:
Download Html Agility Pack
Reference the assembly you have downloaded in your project
Throw everything that starts with the word regex or regular expression out from your project and which deals with parsing HTML (read this answer to better understand why). In your case this would be the contents of the GetLinksFromWebsite method.
Replace what you have thrown away with a simple call to the Html Agility Pack parser.
Here's an example:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using HtmlAgilityPack;
class Program
{
static void Main()
{
using (var client = new WebClient())
{
var htmlSource = client.DownloadString("http://www.stackoverflow.com");
foreach (var item in GetLinksFromWebsite(htmlSource))
{
// TODO: you could easily write a recursive function
// that will call itself here and retrieve the respective contents
// of the site ...
Console.WriteLine(item);
}
}
}
public static List<String> GetLinksFromWebsite(string htmlSource)
{
var doc = new HtmlDocument();
doc.LoadHtml(htmlSource);
return doc
.DocumentNode
.SelectNodes("//a[#href]")
.Select(node => node.Attributes["href"].Value)
.ToList();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Troubles with HtmlAgilityPack - c#

I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting. Try putting a . before the // ".//a[#class=\"question-hyperlink\"]"

Related

XML dialogue tree in Unity not parsing information correctly

How can I extract data from this site using HTMLAgilityPack?

GetSafeHtmlFragment removing all html tags

A C# parser for Web Links (RFC 5988)

C# Getting The HTML of the Links(Content) from Website

Categories

Resources