Using HTML Agility Pack to load all data into listboxes? - c#

I have a page with 300-something rows and wanting to load them all into a list box, but different lists.
I want to put the date in one box, and the other 2 numbers in 2 other boxes also.
HTML ex:
<table>
<tr>
<td>01/01/2017</td>
<td>100</td>
<td>500</td>
</tr>
<tr>
<td>01/02/2017</td>
<td>200</td>
<td>400</td>
</tr>
</table>
My code that pulls this:
private void LoadHTML()
{
int count = 0;
var link = #"http://example.com/data";
HtmlWeb Web = new HtmlWeb();
var htmlDoc = Web.Load(link);
var node = htmlDoc.DocumentNode.SelectNodes("//td");
foreach (var x in node)
{
count = count + 1;
if (count > 5)
{
listBox1.Items.Add(x.InnerText);
}
}
}
listbox1 add's all the data from x, since everything is a td. tr would add each row, but I have nothing to split the data. The count after 5 is where my data starts. There is headers but I don't know how to pull the data from the specific headers in this form.

First you need to get a tr nodes.
Next, iterate it and get the td nodes.
var trNodes = htmlDoc.DocumentNode.SelectNodes("//tr");
foreach (var tr in trNodes)
{
var tdNodes = tr.SelectNodes("./td");
listBox1.Items.Add(tdNodes[0].InnerText);
listBox2.Items.Add(tdNodes[1].InnerText);
listBox3.Items.Add(tdNodes[2].InnerText);
}

Related

Find Multiple Tables using HTML Agility Pack

I am trying to find the second table ""Team and Opponent Stats" from the following website.
https://www.basketball-reference.com/teams/BOS/2017.html
But my code only shows the first table. I've tried all kinds of XPath combinations e.g.
"//table[#id='DataTables_Table_0']/tr/td" , but nothing seems to work.
Here is my code:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var table1 = doc.DocumentNode
.Descendants("tr")
.Select(n => n.Elements("td").Select(p => p.InnerText).ToArray());
foreach (string[] s in table1)
{
foreach (string str in s)
{
Console.WriteLine(str.ToString());
}
//Console.WriteLine(s);
}
foreach (var cell in doc.DocumentNode.SelectNodes("//table[#id='DataTables_Table_0']/tr/td"))
{
Console.WriteLine(cell.InnerText);
}
Here is my modified code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"team_and_opponent\"]//tbody"))
{
//looping on each row, get col1 and col2 of each row
HtmlNodeCollection tds = tr.SelectNodes("td");
for (int i = 0; i < tds.Count; i++)
{
Console.WriteLine(tds[i].InnerText);
}
}
Here is the html code for the section of the website that I want to scrape.
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_and_opponent">
<table class="suppress_all stats_table" id="team_and_opponent" data-cols-to-freeze="1"><caption>Team and Opponent Stats Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr>
<th aria-label=" " data-stat="player" scope="col" class=" poptip sort_default_asc center"> </th>
<th aria-label="Games" data-stat="g" scope="col" class=" poptip sort_default_asc center" data-tip="Games">G</th>
And here is the latest Agility Pack code I'm using to get the right table.
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//*[#id=\"team_and_opponent\"]"))
{
string tempStr = table.InnerText;
foreach (HtmlNode nodecol in table.SelectNodes("//tr")) ///html/body/div[1]/div[2]/div[2]/div/div/div[3]/table[2]/tbody[2]
{
foreach (HtmlNode cell in nodecol.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerHtml.ToString());
I'm still getting a NullReference error message.
That is a dynamic web page (is manipulated by client-side javascript) so the content you download from the server and see in HtmlAgilityPack will not match what you ultimately see in a browser. The table is actually coming back from the server as a comment. Fortunately the comment has the full markup for that table so all you really need to do is select the comment, strip out the comment part of the text, parse it as html, then select as usual.
So if you wanted to load this into a data table for instance, you could do this:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var tableComment = doc.DocumentNode
.SelectSingleNode("//div[#id='all_team_and_opponent']/comment()");
var table = HtmlNode.CreateNode(tableComment.OuterHtml[4..^3])
.SelectSingleNode("//table[#id='team_and_opponent']");
var dataTable = ToDataTable(table);
DataTable ToDataTable(HtmlNode node)
{
var dt= new DataTable();
dt.BeginInit();
foreach (var col in node.SelectNodes("thead/tr/th"))
dt.Columns.Add(col.GetAttributeValue("aria-label", ""), typeof(string));
dt.EndInit();
dt.BeginLoadData();
foreach (var row in node.SelectNodes("tbody/tr"))
dt.Rows.Add(row.SelectNodes("th|td").Select(t => t.InnerText).ToArray());
dt.EndLoadData();
return dt;
}
Check the id of the second table you are looking for. Usually, Ids are meant to be unique within the DOM. So if the first table is called "DataTables_Table_0", the other table you're trying to retrieve might have an Id of "DataTables_Table_1", or something similar. Look at the page's source.
It appears that the table is originally loaded as a comment and is then made visible using Javascript.
You should use something like SelectSingleNode on the comment's xpath (//*[#id="all_team_and_opponent"]/comment()) and get the variable's InnerHtml then you just need to turn it into a visible table by removing the comment tag.
I made a very simple version of what you can do and uploaded it as a Gist so you can simply check my solution and integrate it into your program or test it on dotnetfiddle.net.
However if you need to run any JS file you can use any of the following things:
WebBrowser Class
Should be fairly easy for extracting text when mixed with HTML Agility Pack although it might be trickier for images or other element types. Overall it provides decent performance.
Javascript.Net
It allows you to execute scripts using Chrome's V8 JavaScript engine. You'll just have to find out what files changes the content.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible. This is probably overkill so I recommend any of the above options.

Parse single data elements from HTML tables with C#?

I have this code in my main function and I want to parse only the first row of the table (e.g Nov 7, 2017 73.78 74.00 72.32 72.71 17,245,947).
I created a node that concludes only the first row but when I start debugging the node value is null. How can I parse these data and store them for example in a string or in single variables. Is there a way?
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table/tbody/tr[2]");
List<List<string>> node = doc.DocumentNode.SelectSingleNode("//*[#id=\"prices\"]/table").Descendants("tr").Skip(1).Where(tr => tr.Elements("td").Count() > 1).Select(tr => tr.Elements("td").Select(td=>td.InnerText.Trim()).ToList()).ToList() ;
It seems that your selection XPath string has errors. Since tbody is a generated node it should not be included in path:
//*[#id=\"prices\"]/table/tr[2]
While this should read the value HtmlAgilityPack hits another problem malformed html. All <tr> and <td> nodes in parsed text do not have corresponding </tr> or </td> closing tags and HtmlAgitilityPack fails to select values from table with malformed rows. Therefore, it is necessary to select in first step the whole table:
//*[#id=\"prices\"]/table
And in the next step either sanitize HTML by adding </tr> and </td> closing tags and repeat parsing with corrected table or use extracted string to hand parse it - just extract lines 10 to 15 from table string and split them on > character. Raw parsing is shown below. Code is tested and working.
using System;
using System.Collections.Generic;
using System.IO;
using System.Net;
namespace GoogleFinanceDataScraper
{
class Program
{
static void Main(string[] args)
{
WebClient web = new WebClient();
string page = web.DownloadString("https://finance.google.com/finance/historical?q=NYSE:C&ei=7O4nV9GdJcHomAG02L_wCw");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);
var node = doc.DocumentNode.SelectSingleNode("//div[#id='prices']/table");
string outerHtml = node.OuterHtml;
List<String> data = new List<string>();
using(StringReader reader = new StringReader(outerHtml))
{
for(int i = 0; ; i++)
{
var line = reader.ReadLine();
if (i < 9) continue;
else if (i < 15)
{
var dataRawArray = line.Split(new char[] { '>' });
var value = dataRawArray[1];
data.Add(value);
}
else break;
}
}
Console.WriteLine($"{data[0]}, {data[1]}, {data[2]}, {data[3]}, {data[4]}, {data[5]}");
}
}
}

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

How to Get element that inside another element by class in HtmlAgilityPack

Hello i making HttpWebResponse and getting the HtmlPage with all data that i need for example table with date info that i need to save them to array list and save it to xml file
Example of html Page
<table>
<tr>
<td class="padding5 sorting_1">
<span>01.03.14</span>
</td>
<td class="padding5 sorting_1">
<span>10.03.14</span>
</td>
</tr>
</table>
my code that not working i using the HtmlAgilityPack,with this i can get info from span that have class
private static List<string> GetListDataByClass(string HtmlSourse, string Class)
{
List<string> data = new List<string>();
HtmlAgilityPack.HtmlDocument DocToParse = new HtmlAgilityPack.HtmlDocument();
DocToParse.LoadHtml(HtmlSourse);
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//span[#class='" + Class + "']"))
{
if(node.InnerText!=null) data.Add(node.InnerText);
}
return data;
}
,but in my case td have the class i tryied
foreach (HtmlNode node in DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']"))
but this not worked
Sow i need to read this data to get the date 01.03.14 and 10.02.14
Sow any ideas how can i get this dates(01.03.14 and 10.02.14)?
Just change the XPath query to:
DocToParse.DocumentNode.SelectNodes("//td[#class='" + Class + "']/span")
This will select all the spans that are inside a td element with the corresponding class.

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

Categories