How can I extract data from this site using HTMLAgilityPack?

How can I extract data from this site using HTMLAgilityPack? - c#

I've been following tutorials on how to scrape information using HTMLAgilityPack, here is an example:
using System;
using System.Linq;
using System.Net;
namespace web_scraping_test
{
class Program
{
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.yellowpages.com/search?search_terms=Software&geo_location_terms=Sydney2C+ND");
var names = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var item in names)
{
Console.WriteLine(item.InnerText);
}
}
}
}
This was easy to get the data because there's a common class name and it's simple to get to
I'm trying to use this to scrape information from this site, https://osu.ppy.sh/beatmapsets/354163#osu/780200
but I have no idea about the correct markup to get 'Stitches
Shawn Mendes' and the values given in this diagram:Diagram
For the 'Shawn Mendes' the markup is '<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="https://osu.ppy.sh/beatmapsets?q=Shawn%20Mendes">Shawn Mendes</a>'
but I'm not sure about how to implement this into the code. I've replaced the url and have changed the classname but the directory of this text seems a lot more complicated on this site. Any advice would be appreciated, thanks!

All of the details you're looking for appear to be in a JSON object in the markup. There is a script block with the ID "json-beatmapset", if you scrape the content of that, and parse the JSON it contains, it should be smooth sailing after that.

Related

Load multiple pages with Htmlagilitypack - C#

I've scraped a website with HTMLAgilityPack in C#, and I'm trying to open all link inside it and scrape them with same method.
But when I try to call this method bottom, page is downloaded from library as I have AdBlock active. In fact, I can't find any tables and HTML code downloaded says "ADblock detected".
This is strange because I've filter oddsmath website on my Google Chrome and I can download the master page withouth any problem. Anyone has faced with this problem?
This is the function and the "Console.WriteLine" is just for testing and see full HTML code.
public void GetMatchesDetails()
{
List<String> matchDetails = new List<string>();
foreach (Oddsmath om in oddsmathGoodMatches)
{
matchDetails.Add("http://www.oddsmath.com" + om.matchUrl);
}
foreach (String om in matchDetails)
{
HtmlDocument doc = new HtmlWeb().Load(om);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("html"))
{
Console.WriteLine("Found: " + table.OuterHtml);
foreach (HtmlNode row in table.SelectNodes("tr"))
{
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
}
}
EDIT
Going little deeper, I've noticed that maybe is not a problem of my application or something related to Adblock, but seems connected to website i'm trying to scrape... In fact, if you see a page like this: oddsmath.com/football/international/afc-champions-league-1053/… you can see that content are correctly loaded in browser, but tables are empty inside source code. Why? It's Javascript that prevents loading of page?

First: Use whatever you are most comfortable with HAP vs AngleSharp unless time is really a factor in your application. And in this case it is not.
Second: Use a Web Debugger like Fiddler or Charles to understand what it is that you are actually getting from the when you make a request. Since you are not actually getting any html created with javascript or api calls. You only get the page source. Which is why the tables are empty. They are generated with either javascript.
For instance. I just used a web debugger to see that the site makes an api call to:
http://www.oddsmath.com/api/v1/dropping-odds.json/?sport_type=soccer&provider_id=7&cat_id=0&interval=60&sortBy=1&limit=30&language=en
Then javascript will use this json object to create the rest of page.
And this returns a nice json object that is easier to navigate than with eithr HAP or AngleSharp. I recommend using NewtonSoft JSON.
If you are adamant on using HtmlAgilityPack then you need to combine it with Selenium. Because then you can wait until the page is fully loaded before parsing the HTML.
[Edit]
Further digging:
Api-request to get all the leagues and their ids:
http://www.oddsmath.com/api/v1/menu-leagues.json/?language=en
Api-request for just the asian champions league:
http://www.oddsmath.com/api/v1/events-by-league.json/?language=en&country_code=GB&league_id=1053
Other solution with Selenium with Firefox driver.
Eventhough I highly recommend that you use API and NewtonSoft-JSON to your solution I will provide how it can be done with Selenium.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium;
using System.Threading;
namespace SeleniumHap {
class Program {
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
string url = "http://www.oddsmath.com/football/sweden/division-1-1195/2019-04-26/if-sylvia-vs-nykopings-bis-2858046/";
//string url = "http://www.oddsmath.com/";
FirefoxOptions options = new FirefoxOptions();
//options.AddArguments("--headless");
IWebDriver driver = new FirefoxDriver(options);
driver.Navigate().GoToUrl(url);
while (true) {
doc.LoadHtml(driver.PageSource);
HtmlNode n = doc.DocumentNode.SelectSingleNode("//table[#id='table-odds-cat-0']//*[self::th or self::td]");
if (n != null) {
n = n.SelectSingleNode(".//div[#class='live-odds-loading']");
if (n == null) {
break;
}
}
Thread.Sleep(1000);
}
Console.WriteLine("Exited loop. Meaning the page is done loading since we could get a td. A Crude method but it works");
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
foreach(HtmlNode table in tables) {
Console.WriteLine(table.GetAttributeValue("id", "No id"));
HtmlNodeCollection tableContent = table.SelectNodes(".//*[self::th or self::td]");
foreach(HtmlNode n in tableContent) {
Console.WriteLine(n.InnerHtml);
}
break;
}
Console.ReadKey();
}
}
}
As you can see I use Firefox as my driver instead of chrome. When using either you might have to edit the options where you edit the variable 'BrowserExecutableLocation' to tell where the browser's executable is.
As you can see I am using a while loop in a crude way to make sure that the browser fully loads page before continuing on reading html.

What is the cleanest way to generate HTML in C#?

I'm trying to make a C# program that takes one XML file and turns it into an HTML file, the most obvious way to do such would be with an HtmlTextWriter object, yet I questing needing 6 lines of code to write one tag, an attribute, a line of content, and a closing tag. Is there a cleaner / more efficient way to do this?
The program is using an XML file (format defined by XML Schema) to customize and populate an HTML template with data. An example is shown below:
static string aFileName;
static XmlDocument aParser;
static HtmlTextWriter HTMLIOutput;
static StringWriter HTMLIBuffer;
static StreamWriter HTMLOutIFile;
static HtmlTextWriter HTMLEOutput;
static StringWriter HTMLEBuffer;
static StreamWriter HTMLOutEFile;
HTMLIBuffer = new StringWriter();
HTMLIOutput = new HtmlTextWriter(HTMLIBuffer);
XmlElement feed = aParser.DocumentElement;
HTMLIOutput.WriteBeginTag("em");
HTMLIOutput.WriteAttribute("class", "updated");
HTMLIOutput.Write(HtmlTextWriter.TagRightChar);
HTMLIOutput.Write("Last updated: " +
feed.SelectSingleNode("updated").InnerText.Trim());
HTMLIOutput.WriteEndTag("em");
HTMLIOutput.WriteLine();
HTMLIOutput.WriteLine("<br>");
To write something such as <em class="updated">Last updated: 07/16/2018</em><br />, do I really need to have so many different lines just constructing parts of a tag?
Note: Yes, I could write the contents to the file directly, but if possible I would prefer a more intelligent way so there's less human error involved.

you can always use Obisoft.HSharp:
var Document = new HDoc(DocumentOptions.BasicHTML);
Document["html"]["body"].AddChild("div");
Document["html"]["body"]["div"].AddChild("a", new HProp("href", "/#"));
Document["html"]["body"]["div"].AddChild("table");
Document["html"]["body"]["div"]["table"].AddChildren(
new HTag("tr"),
new HTag("tr", "SomeText"),
new HTag("tr", new HTag("td")));
var Result = Document.GenerateHTML();
Console.WriteLine(Result);
or System.Xml.Linq:
var html = new XElement("html",
new XElement("head",
new XElement("title", "My Page")
),
new XElement("body",
"this is some text"
)
);

Is using something like Razor not applicable here? Because if you're doing a lot of html generation using a view engine can make it a lot easier. It was also built to be used outside of ASP.NET.
However sometimes that's not what you need. Have you considered using the TagBuilder class which is part of .net (mvc)? There is also the HtmlWriter in System.Web.UI (for web forms). I would recommend one of these if you are making Controls or Html Helpers.

This is my suggestion:
Deserialize the XML into C# objects
Use a template engine such as RazorEngine to generate the HTML
I used RazorEngine in the past to generate email templates (in HTML format). They use a similar syntax to ASP.NET MVC views (.cshtml) and you can even make intellisense works with the templates! Also, templates are much easier to create and maintain, compared to XSLT or TagBuilder.
Consider the following model:
public class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
}
You can create a string with the HTML template, or use a file. I recommend using a file with a .cshtml extension, so you can have syntax highlighting and intellisense, as already mentioned:
Template text is the following:
#using RazorEngine.Templating
#using RazorDemo1
#inherits TemplateBase<Person>
<div>
Hello <strong>#Model.FirstName #Model.LastName</strong>
</div>
Loading the template and generating the HTML:
using System;
using System.IO;
using RazorEngine;
using RazorEngine.Templating;
namespace RazorDemo1
{
class Program
{
static void Main(string[] args)
{
string template = File.ReadAllText("./Templates/Person.cshtml");
var person = new Person
{
FirstName = "Rui",
LastName = "Jarimba"
};
string html = Engine.Razor.RunCompile(template, "templateKey", typeof(Person), person);
Console.WriteLine(html);
}
}
}
Output:
<div>
Hello <strong>Rui Jarimba</strong>
</div>

Unity 3D C# Strings.XML Methods

I'm attempting to create a method for Unity3D that will allow me to populate a UI via an XML file. i.e. Rather than naming each button and label, they can carry generic names like "progress button" or "large text" Then using the C# script be matched to the verbose name in the XML file.
I have searched extensively for tutorials, examples and guides but each that I have found has been overkill for what I am trying to accomplish.
Ideally, I'd like to provide an XML file using the following structure in the XML file:
<?xml version="1.0" encoding="utf-8"?>
<strings>
<string name="progressBtn">Next</string>
<string name="reverseBtn">Back</string>
<string name="largeText">This is Large Text</string>
</strings>
I know how to dynamically change text in unity by accessing the properties of text-object's so I'm not worried about that step. What I have currently is this:
using UnityEngine;
using System.Collections;
using System.Xml;
using System.IO;
public class textParser : MonoBehaviour
{
public TextAsset targetXMLFile;
public GameObject uiObjectText;
string targetString;
// Use this for initialization
void Start ()
{
checkFile();//Check for strings file.
checkTarget(uiObjectText.name);//Check for the object name in the GUI object
}
// Update is called once per frame
void Update ()
{
//TODO
}
//Check for strings file.
void checkFile()
{
if (targetXMLFile == null) //If is Null, Log an Error
{
print("Error: target text file not loaded!");
}
else // If something, log the file name
{
print(targetXMLFile.name + " Target text file loaded!");
}
}
//Check for the object name in the GUI object
void checkTarget(string target)
{
if (target == null) //If is Null, Log an Error
{
print("Error: Unable to extract target ui object name!");
}
else// if something, Log the GUI Object name
{
print("Found: " + target + " In GUI.");
}
}
}
Obviously very basic, but it works. I know I need to use the XML libraries to accomplish my search (String matching I understand). Getting to that step is what eludes me.
Any tutorials that are much more towards this use of XML I'd love to look at, or if anyone could give me an idea of what methods I need to access to accomplish this. Personally ,I'd love to understand the verbose process behind what I am trying to do if anyone could provide a link to example code.
Thanks in Advance!

You can use the Xml.Serialization from .NET:
https://unitygem.wordpress.com/xml-serialisation/

Try something like this
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string header = "<?xml version=\"1.0\" encoding=\"utf-8\"?><strings></strings>";
XDocument doc = XDocument.Parse(header);
XElement strings = (XElement)doc.FirstNode;
List<List<string>> buttons = new List<List<string>>() {
new List<string>() {"progressBTn", "Next"},
new List<string>() {"reverseBth", "Back"},
new List<string>() {"largeText", "This is Large Text"}
};
foreach(List<string> button in buttons)
{
strings.Add(
new XElement("string", new object[] {
new XAttribute("name", button[0]),
button[1]
}));
}
}
}
}

Troubles with HtmlAgilityPack

I can't figure out what goes wrong. i just create the poject to test HtmlAgilityPack and what i've got.
using System;
using System.Collections.Generic;
using System.Text;
using HtmlAgilityPack;
namespace parseHabra
{
class Program
{
static void Main(string[] args)
{
HTTP net = new HTTP(); //some http wraper
string result = net.MakeRequest("http://stackoverflow.com/", null);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
//Get all summary blocks
HtmlNodeCollection news = doc.DocumentNode.SelectNodes("//div[#class=\"summary\"]");
foreach (HtmlNode item in news)
{
string title = String.Empty;
//trouble is here for each element item i get the same value
//all the time
title = item.SelectSingleNode("//a[#class=\"question-hyperlink\"]").InnerText.Trim();
Console.WriteLine(title);
}
Console.ReadLine();
}
}
}
It looks like i make xpath not for each node i've selected but to whole document. Any suggestions why it so ? Thx in advance.

I have not tried your code, but from the quick look I suspect the problem is that the // is searching from the root of the entire document and not the root of the current element as I guess you are expecting.
Try putting a . before the //
".//a[#class=\"question-hyperlink\"]"

I'd rewrite your xpath as a single query to find all the question titles, rather than finding the summaries then the titles. Chris' answer points out the problem which could have easily been avoided.
var web = new HtmlWeb();
var doc = web.Load("http://stackoverflow.com");
var xpath = "//div[starts-with(#id,'question-summary-')]//a[#class='question-hyperlink']";
var questionTitles = doc.DocumentNode
.SelectNodes(xpath)
.Select(a => a.InnerText.Trim());

modifying a console application c sharp code to work from within sharepoint

this is a console application code in c sharp for executing CAML queries on sharepoint server 2007
using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.SharePoint;
namespace SharePointUtils
{
class Program
{
static void Main(string[] args)
{
string siteUrl = args[0];
string listName = args[1];
string viewName = args[2];
SPSite site = new SPSite(siteUrl);
SPWeb web = site.OpenWeb();
SPList employeesList = web.Lists[listName];
SPQuery query = new SPQuery(employeesList.Views[viewName]);
System.Diagnostics.Debug.WriteLine(query.ViewXml);
Console.WriteLine(query.ViewXml);
Console.ReadLine();
}
}
}
How would this code change if the same code is not executed as a console application but a the code is executed using a button click in actions / some similar user interaction within sharepoint list view. and the results are also displayed within sharepoint e.g in an aspx page.
And if possible please give some tips on the aspx page creation as well.
Really, a help at any level will be sincerely appreciated.

A first step might be to retrieve the results as a Datatable and bind that to an aspx DataGrid/DataView control.
To get the results as a DataTable you can use the GetDataTable method of SPListItemCollection.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How can I extract data from this site using HTMLAgilityPack? - c#

All of the details you're looking for appear to be in a JSON object in the markup. There is a script block with the ID "json-beatmapset", if you scrape the content of that, and parse the JSON it contains, it should be smooth sailing after that.

Related

Load multiple pages with Htmlagilitypack - C#

What is the cleanest way to generate HTML in C#?

Unity 3D C# Strings.XML Methods

Troubles with HtmlAgilityPack

modifying a console application c sharp code to work from within sharepoint

Categories

Resources