I've scraped a website with HTMLAgilityPack in C#, and I'm trying to open all link inside it and scrape them with same method.
But when I try to call this method bottom, page is downloaded from library as I have AdBlock active. In fact, I can't find any tables and HTML code downloaded says "ADblock detected".
This is strange because I've filter oddsmath website on my Google Chrome and I can download the master page withouth any problem. Anyone has faced with this problem?
This is the function and the "Console.WriteLine" is just for testing and see full HTML code.
public void GetMatchesDetails()
{
List<String> matchDetails = new List<string>();
foreach (Oddsmath om in oddsmathGoodMatches)
{
matchDetails.Add("http://www.oddsmath.com" + om.matchUrl);
}
foreach (String om in matchDetails)
{
HtmlDocument doc = new HtmlWeb().Load(om);
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("html"))
{
Console.WriteLine("Found: " + table.OuterHtml);
foreach (HtmlNode row in table.SelectNodes("tr"))
{
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
}
}
EDIT
Going little deeper, I've noticed that maybe is not a problem of my application or something related to Adblock, but seems connected to website i'm trying to scrape... In fact, if you see a page like this: oddsmath.com/football/international/afc-champions-league-1053/… you can see that content are correctly loaded in browser, but tables are empty inside source code. Why? It's Javascript that prevents loading of page?
First: Use whatever you are most comfortable with HAP vs AngleSharp unless time is really a factor in your application. And in this case it is not.
Second: Use a Web Debugger like Fiddler or Charles to understand what it is that you are actually getting from the when you make a request. Since you are not actually getting any html created with javascript or api calls. You only get the page source. Which is why the tables are empty. They are generated with either javascript.
For instance. I just used a web debugger to see that the site makes an api call to:
http://www.oddsmath.com/api/v1/dropping-odds.json/?sport_type=soccer&provider_id=7&cat_id=0&interval=60&sortBy=1&limit=30&language=en
Then javascript will use this json object to create the rest of page.
And this returns a nice json object that is easier to navigate than with eithr HAP or AngleSharp. I recommend using NewtonSoft JSON.
If you are adamant on using HtmlAgilityPack then you need to combine it with Selenium. Because then you can wait until the page is fully loaded before parsing the HTML.
[Edit]
Further digging:
Api-request to get all the leagues and their ids:
http://www.oddsmath.com/api/v1/menu-leagues.json/?language=en
Api-request for just the asian champions league:
http://www.oddsmath.com/api/v1/events-by-league.json/?language=en&country_code=GB&league_id=1053
Other solution with Selenium with Firefox driver.
Eventhough I highly recommend that you use API and NewtonSoft-JSON to your solution I will provide how it can be done with Selenium.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using HtmlAgilityPack;
using OpenQA.Selenium.Firefox;
using OpenQA.Selenium;
using System.Threading;
namespace SeleniumHap {
class Program {
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
string url = "http://www.oddsmath.com/football/sweden/division-1-1195/2019-04-26/if-sylvia-vs-nykopings-bis-2858046/";
//string url = "http://www.oddsmath.com/";
FirefoxOptions options = new FirefoxOptions();
//options.AddArguments("--headless");
IWebDriver driver = new FirefoxDriver(options);
driver.Navigate().GoToUrl(url);
while (true) {
doc.LoadHtml(driver.PageSource);
HtmlNode n = doc.DocumentNode.SelectSingleNode("//table[#id='table-odds-cat-0']//*[self::th or self::td]");
if (n != null) {
n = n.SelectSingleNode(".//div[#class='live-odds-loading']");
if (n == null) {
break;
}
}
Thread.Sleep(1000);
}
Console.WriteLine("Exited loop. Meaning the page is done loading since we could get a td. A Crude method but it works");
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
foreach(HtmlNode table in tables) {
Console.WriteLine(table.GetAttributeValue("id", "No id"));
HtmlNodeCollection tableContent = table.SelectNodes(".//*[self::th or self::td]");
foreach(HtmlNode n in tableContent) {
Console.WriteLine(n.InnerHtml);
}
break;
}
Console.ReadKey();
}
}
}
As you can see I use Firefox as my driver instead of chrome. When using either you might have to edit the options where you edit the variable 'BrowserExecutableLocation' to tell where the browser's executable is.
As you can see I am using a while loop in a crude way to make sure that the browser fully loads page before continuing on reading html.
Related
I've been following tutorials on how to scrape information using HTMLAgilityPack, here is an example:
using System;
using System.Linq;
using System.Net;
namespace web_scraping_test
{
class Program
{
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.yellowpages.com/search?search_terms=Software&geo_location_terms=Sydney2C+ND");
var names = doc.DocumentNode.SelectNodes("//a[#class='business-name']").ToList();
foreach (var item in names)
{
Console.WriteLine(item.InnerText);
}
}
}
}
This was easy to get the data because there's a common class name and it's simple to get to
I'm trying to use this to scrape information from this site, https://osu.ppy.sh/beatmapsets/354163#osu/780200
but I have no idea about the correct markup to get 'Stitches
Shawn Mendes' and the values given in this diagram:Diagram
For the 'Shawn Mendes' the markup is '<a class="beatmapset-header__details-text beatmapset-header__details-text--artist" href="https://osu.ppy.sh/beatmapsets?q=Shawn%20Mendes">Shawn Mendes</a>'
but I'm not sure about how to implement this into the code. I've replaced the url and have changed the classname but the directory of this text seems a lot more complicated on this site. Any advice would be appreciated, thanks!
All of the details you're looking for appear to be in a JSON object in the markup. There is a script block with the ID "json-beatmapset", if you scrape the content of that, and parse the JSON it contains, it should be smooth sailing after that.
I'm using the Selenium WebDriver NuGet package for C#. As part of my tests, I'm checking the text of a paragraph. However, the HTML for the paragraph looks like this:
<p>This is <strong>bold</strong>.</p>
...and if I have an IWebElement representing the p tag, then the .Text property returns
This is .
In other words, it only returns the text from the p tag, and not from the embedded strong tag.
There doesn't seem to be any method or property on IWebElement that would allow me to get the full text of the p tag and its children.
So... how can it be done?
I'm out of the office right now, but my colleague informs me that the problem can be resolved by casting the IWebElement returned by GetElementById to RemoteWebElement and then calling the Text property on that.
This is very surprising - I would have thought that Text would be a virtual property, and that the behaviour would be defined by the run-time type, not the compile-time type.
UPDATE
It appears that my colleague was mistaken. Casting to RemoteWebElement did not fix the problem. Rather, it seems that breaking in the debugger and inspecting the Text property caused it to return the correct value.
I've now tried to reproduce this problem in a minimal program (see below), and (surprise!) I can't reproduce it. The Text property is behaving correctly. I'll continue to investigate what's different about my real setup.
namespace SeleniumTest
{
using System;
using System.Linq;
using OpenQA.Selenium.IE;
using OpenQA.Selenium.Support.UI;
public class Program
{
public static void Main(string[] args)
{
const string ExamplePageUrl = "http://www.nngroup.com/consulting/ux-research-usability-testing/";
var webDriver = new InternetExplorerDriver();
webDriver.Navigate().GoToUrl(ExamplePageUrl);
var wait = new WebDriverWait(webDriver, TimeSpan.FromSeconds(10));
wait.Until(w => w.Title == "Nielsen Norman Group: UX Research, Training, and Consulting");
var paras = webDriver.FindElementsByTagName("p");
var para = paras.FirstOrDefault(p => p.Text.Contains("We test your website or application"));
if (para == null)
{
Console.WriteLine("Dang. Looks like the website changed.");
}
else
{
Console.WriteLine(para.Text);
}
Console.ReadLine();
}
}
}
I'm in a middle of process of creating utility console app to help me register to certain classes at my university. So far I've made it download website's content and frequently check for specific changes which gives me an information when the given course is full or available to be taken. Like that:
WebRequest request2 = WebRequest.Create("https://usosweb.umk.pl/kontroler.php?_action=katalog2/przedmioty/pokazPrzedmiot&prz_kod=0600-OG-ChH");
request2.Method = "GET";
WebResponse response2 = request2.GetResponse();
Stream stream2 = response2.GetResponseStream();
StreamReader reader2 = new StreamReader(stream2);
string content_2 = reader2.ReadToEnd();
string drugi = getBetween(content_2, #"Stan zapełnienia grup:</b>
<b>", "</b> (zarejestrowanych/limit)");
reader2.Close();
response2.Close();
if (drugi != pierwszy)
{
Console.WriteLine("Rejestracja!");
Console.Beep(3200, 900);
System.Diagnostics.Process.Start("https://usosweb.umk.pl/kontroler.php?_action=katalog2/przedmioty/pokazPrzedmiot&prz_kod=0600-OG-ChH");
pierwszy = drugi;
}
The problem is that it still requires my full attention, as I made it only open the website with the registration buttton on and my goal is to make it actually click it automatically after the slot opens.
Few things to note:
I have to be logged on that website in order to be able to register at the course
http://i.imgur.com/DtBCG3Q.jpg <- this is how that button is coded. The chain_ function is named differently with every single refresh
http://i.imgur.com/tGX5kmy.jpg <- that is how the registration panel looks like. Ideally I want a website to open in a default browser (or somewhere with cache so I am already logged in) and automatically press that button, as it doesn't require additional confirmation.
links to one of websites at my university are included in the code above so you may have an additional look on how the buton is coded and how that case could be solved.
After all, is that even possible? Am I able to code it through? I'm using C# but some additional snippets of codes in other languagues could be put it, if that will make it easier or possible.
I think that for this kind of task automation Selenium WebDriver is the best tool
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using OpenQA.Selenium;
using OpenQA.Selenium.Support.UI;
using OpenQA.Selenium.Chrome;
namespace WEBDRIVER
{
class Program
{
static void Main(string[] args)
{
IWebDriver driver = new ChromeDriver();
driver.Navigate().GoToUrl("http://www.google.com/");
IWebElement query = driver.FindElement(By.Name("q"));
query.SendKeys("banana");
query.Submit();
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
wait.Until((d) => { return d.Title.ToLower().StartsWith("banana"); });
System.Console.WriteLine("Page title is: " + driver.Title);
driver.Quit();
}
}
}
I am using GetSafeHtmlFragment in my website and I found that all of tags except <p> and <a> is removed.
I researched around and I found that there is no resolution for it from Microsoft.
Is there any superseded for it or is there any solution?
Thanks.
Amazing that Microsoft in the 4.2.1 version terribly overcompensated for a security leak in the 4.2 XSS library and now still hasn't updated a year later. The GetSafeHtmlFragment method should have been renamed to StripHtml as I read someone commenting somewhere.
I ended up using the HtmlSanitizer library suggested in this related SO issue. I liked that it was available as a package through NuGet.
This library basically implements a variation of the white-list approach the now accepted answer uses. However it is based on CsQuery instead of the HTML Agility library. The package also gives some additional options, like being able to keep style information (e.g. HTML attributes). Using this library resulted in code in my project something like below, which - at least - is a lot less code than the accepted answer :).
using Html;
...
var sanitizer = new HtmlSanitizer();
sanitizer.AllowedTags = new List<string> { "p", "ul", "li", "ol", "br" };
string sanitizedHtml = sanitizer.Sanitize(htmlString);
An alternative solution would be to use the Html Agility Pack in conjunction with your own tags white list :
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;
using HtmlAgilityPack;
class Program
{
static void Main(string[] args)
{
var whiteList = new[]
{
"#comment", "html", "head",
"title", "body", "img", "p",
"a"
};
var html = File.ReadAllText("input.html");
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodesToRemove = new List<HtmlAgilityPack.HtmlNode>();
var e = doc
.CreateNavigator()
.SelectDescendants(System.Xml.XPath.XPathNodeType.All, false)
.GetEnumerator();
while (e.MoveNext())
{
var node =
((HtmlAgilityPack.HtmlNodeNavigator)e.Current)
.CurrentNode;
if (!whiteList.Contains(node.Name))
{
nodesToRemove.Add(node);
}
}
nodesToRemove.ForEach(node => node.Remove());
var sb = new StringBuilder();
using (var w = new StringWriter(sb))
{
doc.Save(w);
}
Console.WriteLine(sb.ToString());
}
}
How can I simulate the functions/actions of a proxy server but without calling elements like HttpListener or TcpListener? How can I generate them from within my C# application?
I've been able to get as far as getting actual data streamed back to my WebBrowser element in my C# application but upon viewing the results, it gives me errors. The reason being is because I'm viewing the LITERAL string and there are JS/CSS components within the resulting HTML stream that makes references to objects via relative URIs. Obviously, my solution thinks they're local and, as such, can't resolve them.
I'm missing proxy-like functions where it should just hand off the stream back to my mock browser and display properly. However, looking at sample proxy server codes built on C#, they're all built as servers using listeners. I'd like it to be something that I can instantiate locally without the need to create a listening interface.
Now, you may be wondering why I'm trying to do this? Well, there are a couple of reasons:
To be able to inject headers ad-hoc so I can test internal web servers
To run as a headless (no GUI) component that can take either HTTP or HTTPS streams from other .NET components and inject headers from, yet, other .NET components.
Some other back-end stuff that I think might but won't know until I have this in place.
Here's what I have so far:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using HtmlAgilityPack;
using System.Net;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
WebClient client = new WebClient();
var baseUrl = new Uri(textBox1.Text);
client.Headers.Add("Token1", textBox2.Text);
client.Headers.Add("Token2",textBox3.Text);
byte[] requestHTML = client.DownloadData(textBox1.Text);
string sourceHTML = new UTF8Encoding().GetString(requestHTML);
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(sourceHTML);
//"//*[#background or #lowsrc or #src or #href]"
foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//*[#href]"))
{
//Console.Out.WriteLine(link.ToString());
if (!string.IsNullOrEmpty(link.Attributes["href"].Value))
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine("Before: " + att.Value);
//Console.Out.WriteLine(att.Value.ToString());
Console.WriteLine(new Uri(baseUrl, att.Value));
link.Attributes["href"].Value = new Uri(baseUrl, att.Value).ToString();
Console.WriteLine("After: " + link.Attributes["href"].Value);
//att.Value = this.AbsoluteUrlByRelative(att.Value);
}
}
foreach (HtmlNode link2 in htmlDoc.DocumentNode.SelectNodes("//*[#src]"))
{
//Console.Out.WriteLine(link.ToString());
if (!string.IsNullOrEmpty(link2.Attributes["src"].Value))
{
HtmlAttribute att = link2.Attributes["src"];
Console.WriteLine("Before: " + att.Value);
// //Console.Out.WriteLine(att.Value.ToString());
Console.WriteLine(new Uri(baseUrl, att.Value));
if (!att.Value.Contains("/WS"))
{
Console.WriteLine("HIT ME!");
var output = "/WS/" + att.Value;
link2.Attributes["src"].Value = new Uri(baseUrl, output).ToString();
Console.WriteLine("After HIT: " + link2.Attributes["src"].Value);
}
else
{
link2.Attributes["src"].Value = new Uri(baseUrl, att.Value).ToString();
Console.WriteLine("After: " + link2.Attributes["src"].Value);
}
// //att.Value = this.AbsoluteUrlByRelative(att.Value);
}
}
Console.WriteLine(htmlDoc.DocumentNode.OuterHtml);
Console.WriteLine("+========================+");
webBrowser1.DocumentText = htmlDoc.DocumentNode.OuterHtml;
}
}
}
Again, this is just prototyped code so forgive the wacky spacing and commenting. In the end, it will be more formal. Right now, this monkey is killing my back.
How about using something like NMock or similar? It would mean having to introduce interfaces so that the mocks can be injected, but still beats doing it almost any other way, IMHO...
From the NMock site:
NMock is a dynamic mock object library for .NET. Mock objects make it
easier to test single components—often single classes—without relying
on real implementations of all of the other components. This means we
can test just one class, rather than a whole tree of objects, and can
pinpoint bugs much more clearly. Mock objects are often used during
Test Driven Development.
You would mock the proxy server more or less like this:
var mocks = new Mockery();
var mockProxyServer = mocks.NewMock<IMyProxyServer>();
That's all you need to do. As you can see, it's interface-dependent. But usually all that I've needed to do is Refactor->Extract Interfaces from the relevant class in VS.
Setting up the simulation is usually done within the context of the unit test, like:
public class TransferFundsPresenterTest
{
private Mockery mocks;
private IMyProxyServer mockProxyServer
[SetUp]
public void SetUp()
{
mocks = new Mockery();
mockProxyServer = mocks.NewMock<IMyProxyServer>();
}
[Test]
public void TestProxyFunction()
{
Expect.Once.On(mockProxyServer).
Method("ProxyFunctionA").
With("1234"). // <-- simulate the input params here
Will(Return.Value("Test")); // <-- simulate the output from server here
}
This is just a basic example. You can do a lot more, it's a very flexible library.
You really should take a look at the NMock site, it's pretty easy to get fully up to speed with the library.
http://www.nmock.org/index.html