HTML Agility Pack cells merged - c#

I'm trying to pull a table off of a website using the HTML Agility Pack. I'm having a problem extracting the column data. Each row should have 6 columns. However when I read the cells it's merging all column data into one result.
I'm getting this:
Vintage Buff Banner665c12425
Instead of this:
Vintage Buff Banner
665c
1
24
Blank
25
Code I'm using is below:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.tf2wh.com/backpack?bp=x44rUEmREP-OCT9Kp-9w6n3GOJQJpf43YQD_dp98AvY");
var xpath = "/html/body/div[#class='page']/div[#class='main']/div[#class='specialtrade']/table[#class='data']/tbody/tr[#class='normal']";
var rows = doc.DocumentNode.SelectNodes(xpath);
foreach (HtmlNode row in rows)
{
HtmlNodeCollection cells = row.SelectNodes("th|td");
foreach (HtmlNode cell in cells)
{
Console.WriteLine("cell: " + cell.InnerText);
}
}

I figured it out - it was bad HTML. I ran it through Tidy.NET before HTML Agility Pack, and I'm getting the results I want.

Related

how to add data in href with HTML Agility Pack

I have a code to get all the 5 links from the site, so I need to change these links by putting "https://advancecare.pt" before......
For now I have this code:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("myLink");
foreach (HtmlNode ic in doc.DocumentNode.SelectNodes("//div[#class='component row-splitter']"))
{
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlNode test = doc.DocumentNode.SelectNodes("//a[#href]").First();
string hrefValue = link.GetAttributeValue("href", string.Empty);
// test.SetAttributeValue("href", "mylink" + hrefValue);
link.SetAttributeValue("href", "mylink" + hrefValue);
}
}
This code return:
https:mylinkmylinkmylinkmylink/hrefValue
You iterate all div nodes in the document, and then iterate all links within document, so each link is being processed as many times as div elements in the document.
Seach only links which are children for div:
foreach (HtmlNode link in ic.SelectNodes("//a[#href]"))
...

Find Multiple Tables using HTML Agility Pack

I am trying to find the second table ""Team and Opponent Stats" from the following website.
https://www.basketball-reference.com/teams/BOS/2017.html
But my code only shows the first table. I've tried all kinds of XPath combinations e.g.
"//table[#id='DataTables_Table_0']/tr/td" , but nothing seems to work.
Here is my code:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var table1 = doc.DocumentNode
.Descendants("tr")
.Select(n => n.Elements("td").Select(p => p.InnerText).ToArray());
foreach (string[] s in table1)
{
foreach (string str in s)
{
Console.WriteLine(str.ToString());
}
//Console.WriteLine(s);
}
foreach (var cell in doc.DocumentNode.SelectNodes("//table[#id='DataTables_Table_0']/tr/td"))
{
Console.WriteLine(cell.InnerText);
}
Here is my modified code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"team_and_opponent\"]//tbody"))
{
//looping on each row, get col1 and col2 of each row
HtmlNodeCollection tds = tr.SelectNodes("td");
for (int i = 0; i < tds.Count; i++)
{
Console.WriteLine(tds[i].InnerText);
}
}
Here is the html code for the section of the website that I want to scrape.
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_and_opponent">
<table class="suppress_all stats_table" id="team_and_opponent" data-cols-to-freeze="1"><caption>Team and Opponent Stats Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr>
<th aria-label=" " data-stat="player" scope="col" class=" poptip sort_default_asc center"> </th>
<th aria-label="Games" data-stat="g" scope="col" class=" poptip sort_default_asc center" data-tip="Games">G</th>
And here is the latest Agility Pack code I'm using to get the right table.
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//*[#id=\"team_and_opponent\"]"))
{
string tempStr = table.InnerText;
foreach (HtmlNode nodecol in table.SelectNodes("//tr")) ///html/body/div[1]/div[2]/div[2]/div/div/div[3]/table[2]/tbody[2]
{
foreach (HtmlNode cell in nodecol.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerHtml.ToString());
I'm still getting a NullReference error message.
That is a dynamic web page (is manipulated by client-side javascript) so the content you download from the server and see in HtmlAgilityPack will not match what you ultimately see in a browser. The table is actually coming back from the server as a comment. Fortunately the comment has the full markup for that table so all you really need to do is select the comment, strip out the comment part of the text, parse it as html, then select as usual.
So if you wanted to load this into a data table for instance, you could do this:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var tableComment = doc.DocumentNode
.SelectSingleNode("//div[#id='all_team_and_opponent']/comment()");
var table = HtmlNode.CreateNode(tableComment.OuterHtml[4..^3])
.SelectSingleNode("//table[#id='team_and_opponent']");
var dataTable = ToDataTable(table);
DataTable ToDataTable(HtmlNode node)
{
var dt= new DataTable();
dt.BeginInit();
foreach (var col in node.SelectNodes("thead/tr/th"))
dt.Columns.Add(col.GetAttributeValue("aria-label", ""), typeof(string));
dt.EndInit();
dt.BeginLoadData();
foreach (var row in node.SelectNodes("tbody/tr"))
dt.Rows.Add(row.SelectNodes("th|td").Select(t => t.InnerText).ToArray());
dt.EndLoadData();
return dt;
}
Check the id of the second table you are looking for. Usually, Ids are meant to be unique within the DOM. So if the first table is called "DataTables_Table_0", the other table you're trying to retrieve might have an Id of "DataTables_Table_1", or something similar. Look at the page's source.
It appears that the table is originally loaded as a comment and is then made visible using Javascript.
You should use something like SelectSingleNode on the comment's xpath (//*[#id="all_team_and_opponent"]/comment()) and get the variable's InnerHtml then you just need to turn it into a visible table by removing the comment tag.
I made a very simple version of what you can do and uploaded it as a Gist so you can simply check my solution and integrate it into your program or test it on dotnetfiddle.net.
However if you need to run any JS file you can use any of the following things:
WebBrowser Class
Should be fairly easy for extracting text when mixed with HTML Agility Pack although it might be trickier for images or other element types. Overall it provides decent performance.
Javascript.Net
It allows you to execute scripts using Chrome's V8 JavaScript engine. You'll just have to find out what files changes the content.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible. This is probably overkill so I recommend any of the above options.

how do i get all the value of a table from a website

string Url = "http://www.dsebd.org/latest_share_price_scroll_l.php";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(Url);
string a = doc.DocumentNode.SelectNodes("//iframe*[#src=latest_share_price_all\"]//html/body/div/table/tbody")[0].InnerText;
i have tried, but null value found in string a.
Ok this one confused me for a while but I've got it now. Instead of pulling the whole page from http://www.dsebd.org/latest_share_price_scroll_l.php, you can get just the table data from http://www.dsebd.org/latest_share_price_all.php.
There was some strange behaviour with trying to select child elements of the #document node under the iframe element. Someone with more xpath experience might be able to explain this.
Now you can get all the table row nodes by using the following xpath:
string url = "http://www.dsebd.org/latest_share_price_all.php";
HtmlDocument doc = new HtmlWeb().Load(url);
HtmlNode docNode = doc.DocumentNode;
var nodes = docNode.SelectNodes("//body/div/table/tr");
That will give you all the table row nodes. Then you need to go through each node you just got and get the values you want.
Just for example if you wanted to get the trading code, high, and volume you would do the following:
//Remove the first node because it is the header row at the top of the table
nodes.RemoveAt(0);
foreach(HtmlNode rowNode in nodes)
{
HtmlNode tradingCodeNode = rowNode.SelectSingleNode("td[2]/a");
string tradingCode = tradingCodeNode.InnerText;
HtmlNode highNode = rowNode.SelectSingleNode("td[4]");
string highValue = highNode.InnerText;
HtmlNode volumeNode = rowNode.SelectSingleNode("td[11]");
string volumeValue = volumeNode.InnerText;
//Do whatever you want with the values here
//Put them in a class or add them to a list
}
XPath uses 1-based indices so when you are referring to a particular cell in a table row by number the first element is at index 1, instead of using index 0 as in a C# array.

C# count paragraphs in div from a website's html source code

Using Html Agility Pack i have been trying to count the number of paragraphs tags in each div tag and get the div id and class(if they exist) of the one that has the most paragraphs but i'm having trouble with the syntax.
My code looks like this:
// HtmlDocument is stored in doc
HtmlAgilityPack.HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div");
foreach (HtmlAgilityPack.HtmlNode divNode in div)
{
var x = divNode.DescendantNodes("p").Count; // doesn't actually work
// x should also be stored in a list
}
If anyone could point me to right direction or provide me with examples, it would really help. Thanks!
How about this way :
//get the maximum number of paragraph
int maxNumberOfParagraph =
doc.DocumentNode
.SelectNodes("//div[.//p]")
.Max(o => o.SelectNodes(".//p").Count);
//get divs having number of containing paragraph equals maxNumberOfParagraph
var divs = doc.DocumentNode
.SelectNodes("//div[.//p]")
.Where(o => o.SelectNodes(".//p").Count == maxNumberOfParagraph);

Retrieving value of element using HTMLAgility pack

I am using HTMLAgility pack to parse html and then using xpath retrieve a table column with a specific class.
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("www.url.com");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
Response.Write(row.InnerHtml + "<br />");
}
I retrieve the data and row.Innerhtml looks like this.
<a>Title</a> <span>Year</span><br />
I want to save the value of a and span element in separate string variables. Please help
Your xpath expression selects the second <td> that has the class titleColumn. According to the node's inner html, this <td> hode has two child nodes: <a> and <span>. So you could easily find these nodes, and then put inner text (or inner html) into your string variables. See, this:
foreach (var row in doc.DocumentNode.SelectNodes("(//td[#class='titleColumn'])[2]"))
{
var a = row.SelectSingleNode("a");
var span = row.SelectSingleNode("span");
Console.WriteLine(a.InnerText);
Console.WriteLine(span.InnerText);
}
will output:
Title
Year

Categories