Find Multiple Tables using HTML Agility Pack - c#

I am trying to find the second table ""Team and Opponent Stats" from the following website.
https://www.basketball-reference.com/teams/BOS/2017.html
But my code only shows the first table. I've tried all kinds of XPath combinations e.g.
"//table[#id='DataTables_Table_0']/tr/td" , but nothing seems to work.
Here is my code:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var table1 = doc.DocumentNode
.Descendants("tr")
.Select(n => n.Elements("td").Select(p => p.InnerText).ToArray());
foreach (string[] s in table1)
{
foreach (string str in s)
{
Console.WriteLine(str.ToString());
}
//Console.WriteLine(s);
}
foreach (var cell in doc.DocumentNode.SelectNodes("//table[#id='DataTables_Table_0']/tr/td"))
{
Console.WriteLine(cell.InnerText);
}
Here is my modified code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"team_and_opponent\"]//tbody"))
{
//looping on each row, get col1 and col2 of each row
HtmlNodeCollection tds = tr.SelectNodes("td");
for (int i = 0; i < tds.Count; i++)
{
Console.WriteLine(tds[i].InnerText);
}
}
Here is the html code for the section of the website that I want to scrape.
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_and_opponent">
<table class="suppress_all stats_table" id="team_and_opponent" data-cols-to-freeze="1"><caption>Team and Opponent Stats Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr>
<th aria-label=" " data-stat="player" scope="col" class=" poptip sort_default_asc center"> </th>
<th aria-label="Games" data-stat="g" scope="col" class=" poptip sort_default_asc center" data-tip="Games">G</th>
And here is the latest Agility Pack code I'm using to get the right table.
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//*[#id=\"team_and_opponent\"]"))
{
string tempStr = table.InnerText;
foreach (HtmlNode nodecol in table.SelectNodes("//tr")) ///html/body/div[1]/div[2]/div[2]/div/div/div[3]/table[2]/tbody[2]
{
foreach (HtmlNode cell in nodecol.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerHtml.ToString());
I'm still getting a NullReference error message.

That is a dynamic web page (is manipulated by client-side javascript) so the content you download from the server and see in HtmlAgilityPack will not match what you ultimately see in a browser. The table is actually coming back from the server as a comment. Fortunately the comment has the full markup for that table so all you really need to do is select the comment, strip out the comment part of the text, parse it as html, then select as usual.
So if you wanted to load this into a data table for instance, you could do this:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var tableComment = doc.DocumentNode
.SelectSingleNode("//div[#id='all_team_and_opponent']/comment()");
var table = HtmlNode.CreateNode(tableComment.OuterHtml[4..^3])
.SelectSingleNode("//table[#id='team_and_opponent']");
var dataTable = ToDataTable(table);
DataTable ToDataTable(HtmlNode node)
{
var dt= new DataTable();
dt.BeginInit();
foreach (var col in node.SelectNodes("thead/tr/th"))
dt.Columns.Add(col.GetAttributeValue("aria-label", ""), typeof(string));
dt.EndInit();
dt.BeginLoadData();
foreach (var row in node.SelectNodes("tbody/tr"))
dt.Rows.Add(row.SelectNodes("th|td").Select(t => t.InnerText).ToArray());
dt.EndLoadData();
return dt;
}

Check the id of the second table you are looking for. Usually, Ids are meant to be unique within the DOM. So if the first table is called "DataTables_Table_0", the other table you're trying to retrieve might have an Id of "DataTables_Table_1", or something similar. Look at the page's source.

It appears that the table is originally loaded as a comment and is then made visible using Javascript.
You should use something like SelectSingleNode on the comment's xpath (//*[#id="all_team_and_opponent"]/comment()) and get the variable's InnerHtml then you just need to turn it into a visible table by removing the comment tag.
I made a very simple version of what you can do and uploaded it as a Gist so you can simply check my solution and integrate it into your program or test it on dotnetfiddle.net.
However if you need to run any JS file you can use any of the following things:
WebBrowser Class
Should be fairly easy for extracting text when mixed with HTML Agility Pack although it might be trickier for images or other element types. Overall it provides decent performance.
Javascript.Net
It allows you to execute scripts using Chrome's V8 JavaScript engine. You'll just have to find out what files changes the content.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible. This is probably overkill so I recommend any of the above options.

Related

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

HTML Agility Pack cells merged

I'm trying to pull a table off of a website using the HTML Agility Pack. I'm having a problem extracting the column data. Each row should have 6 columns. However when I read the cells it's merging all column data into one result.
I'm getting this:
Vintage Buff Banner665c12425
Instead of this:
Vintage Buff Banner
665c
1
24
Blank
25
Code I'm using is below:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.tf2wh.com/backpack?bp=x44rUEmREP-OCT9Kp-9w6n3GOJQJpf43YQD_dp98AvY");
var xpath = "/html/body/div[#class='page']/div[#class='main']/div[#class='specialtrade']/table[#class='data']/tbody/tr[#class='normal']";
var rows = doc.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode row in rows)
{
HtmlNodeCollection cells = row.SelectNodes("th|td");
foreach (HtmlNode cell in cells)
{
Console.WriteLine("cell: " + cell.InnerText);
}
}
I figured it out - it was bad HTML. I ran it through Tidy.NET before HTML Agility Pack, and I'm getting the results I want.

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

C# parse html with xpath

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}
Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.
If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.
This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}
Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")
If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

Get Links in class with html agility pack

There are a bunch of tr's with the class alt. I want to get all the links (or the first of last) yet i cant figure out how with html agility pack.
I tried variants of a but i only get all the links or none. It doesnt seem to only get the one in the node which makes no sense since i am writing n.SelectNodes
html.LoadHtml(page);
var nS = html.DocumentNode.SelectNodes("//tr[#class='alt']");
foreach (var n in nS)
{
var aS = n.SelectNodes("a");
...
}
You can use LINQ:
var links = html.DocumentNode
.Descendants("tr")
.Where(tr => tr.GetAttributeValue("class", "").Contains("alt"))
.SelectMany(tr => tr.Descendants("a"))
.ToArray();
Note that this will also match <tr class="Malto">; you may want to replace the Contains call with a regex.
You could also use Fizzler:
html.DocumentNode.QuerySelectorAll("tr.alt a");
Note that both methods will also return anchors that aren't links.
Why not select all links in single query:
html.LoadHtml(page);
var nS = html.DocumentNode.SelectNodes("//tr[#class='alt']//a");
foreach(HtmlNode linkNode in nS)
{
//do something
}
It's valid for html:
<table>
<tr class = "alt">
<td><'a href="link.html">Some Link</a></td>
</tr>
</table>

Categories