Get Links in class with html agility pack - c#

There are a bunch of tr's with the class alt. I want to get all the links (or the first of last) yet i cant figure out how with html agility pack.
I tried variants of a but i only get all the links or none. It doesnt seem to only get the one in the node which makes no sense since i am writing n.SelectNodes
html.LoadHtml(page);
var nS = html.DocumentNode.SelectNodes("//tr[#class='alt']");
foreach (var n in nS)
{
var aS = n.SelectNodes("a");
...
}

You can use LINQ:
var links = html.DocumentNode
.Descendants("tr")
.Where(tr => tr.GetAttributeValue("class", "").Contains("alt"))
.SelectMany(tr => tr.Descendants("a"))
.ToArray();
Note that this will also match <tr class="Malto">; you may want to replace the Contains call with a regex.
You could also use Fizzler:
html.DocumentNode.QuerySelectorAll("tr.alt a");
Note that both methods will also return anchors that aren't links.

Why not select all links in single query:
html.LoadHtml(page);
var nS = html.DocumentNode.SelectNodes("//tr[#class='alt']//a");
foreach(HtmlNode linkNode in nS)
{
//do something
}
It's valid for html:
<table>
<tr class = "alt">
<td><'a href="link.html">Some Link</a></td>
</tr>
</table>

Related

Find Multiple Tables using HTML Agility Pack

I am trying to find the second table ""Team and Opponent Stats" from the following website.
https://www.basketball-reference.com/teams/BOS/2017.html
But my code only shows the first table. I've tried all kinds of XPath combinations e.g.
"//table[#id='DataTables_Table_0']/tr/td" , but nothing seems to work.
Here is my code:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var table1 = doc.DocumentNode
.Descendants("tr")
.Select(n => n.Elements("td").Select(p => p.InnerText).ToArray());
foreach (string[] s in table1)
{
foreach (string str in s)
{
Console.WriteLine(str.ToString());
}
//Console.WriteLine(s);
}
foreach (var cell in doc.DocumentNode.SelectNodes("//table[#id='DataTables_Table_0']/tr/td"))
{
Console.WriteLine(cell.InnerText);
}
Here is my modified code:
foreach (HtmlNode tr in doc.DocumentNode.SelectNodes("//table[#id=\"team_and_opponent\"]//tbody"))
{
//looping on each row, get col1 and col2 of each row
HtmlNodeCollection tds = tr.SelectNodes("td");
for (int i = 0; i < tds.Count; i++)
{
Console.WriteLine(tds[i].InnerText);
}
}
Here is the html code for the section of the website that I want to scrape.
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team_and_opponent">
<table class="suppress_all stats_table" id="team_and_opponent" data-cols-to-freeze="1"><caption>Team and Opponent Stats Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr>
<th aria-label=" " data-stat="player" scope="col" class=" poptip sort_default_asc center"> </th>
<th aria-label="Games" data-stat="g" scope="col" class=" poptip sort_default_asc center" data-tip="Games">G</th>
And here is the latest Agility Pack code I'm using to get the right table.
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//*[#id=\"team_and_opponent\"]"))
{
string tempStr = table.InnerText;
foreach (HtmlNode nodecol in table.SelectNodes("//tr")) ///html/body/div[1]/div[2]/div[2]/div/div/div[3]/table[2]/tbody[2]
{
foreach (HtmlNode cell in nodecol.SelectNodes("th|td"))
{
Console.WriteLine("cell: " + cell.InnerHtml.ToString());
I'm still getting a NullReference error message.
That is a dynamic web page (is manipulated by client-side javascript) so the content you download from the server and see in HtmlAgilityPack will not match what you ultimately see in a browser. The table is actually coming back from the server as a comment. Fortunately the comment has the full markup for that table so all you really need to do is select the comment, strip out the comment part of the text, parse it as html, then select as usual.
So if you wanted to load this into a data table for instance, you could do this:
var url = "https://www.basketball-reference.com/teams/BOS/2017.html";
var web = new HtmlWeb();
var doc = web.Load(url);
var tableComment = doc.DocumentNode
.SelectSingleNode("//div[#id='all_team_and_opponent']/comment()");
var table = HtmlNode.CreateNode(tableComment.OuterHtml[4..^3])
.SelectSingleNode("//table[#id='team_and_opponent']");
var dataTable = ToDataTable(table);
DataTable ToDataTable(HtmlNode node)
{
var dt= new DataTable();
dt.BeginInit();
foreach (var col in node.SelectNodes("thead/tr/th"))
dt.Columns.Add(col.GetAttributeValue("aria-label", ""), typeof(string));
dt.EndInit();
dt.BeginLoadData();
foreach (var row in node.SelectNodes("tbody/tr"))
dt.Rows.Add(row.SelectNodes("th|td").Select(t => t.InnerText).ToArray());
dt.EndLoadData();
return dt;
}
Check the id of the second table you are looking for. Usually, Ids are meant to be unique within the DOM. So if the first table is called "DataTables_Table_0", the other table you're trying to retrieve might have an Id of "DataTables_Table_1", or something similar. Look at the page's source.
It appears that the table is originally loaded as a comment and is then made visible using Javascript.
You should use something like SelectSingleNode on the comment's xpath (//*[#id="all_team_and_opponent"]/comment()) and get the variable's InnerHtml then you just need to turn it into a visible table by removing the comment tag.
I made a very simple version of what you can do and uploaded it as a Gist so you can simply check my solution and integrate it into your program or test it on dotnetfiddle.net.
However if you need to run any JS file you can use any of the following things:
WebBrowser Class
Should be fairly easy for extracting text when mixed with HTML Agility Pack although it might be trickier for images or other element types. Overall it provides decent performance.
Javascript.Net
It allows you to execute scripts using Chrome's V8 JavaScript engine. You'll just have to find out what files changes the content.
Selenium
You can use Selenium+a webdriver for your prefered browser (Chrome, Firefox, PhantomJS). It is somewhat slow but is very flexible. This is probably overkill so I recommend any of the above options.

Get links with specific words from a HTML code in C#

I am trying to parse a website. I need some links in HTML file which contains some specific words. I know how to find "href" attributes but I don't need all of them, is there anyway to do that? For example can I use regex in HtmlAgilityPack?
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
this.dgvurl.Rows.Add(urls.Attributes["href"].Value);
}
I'm trying this for finding all links in HTML code.
If you have an HTML file like this:
<div class="a">
</div>
And you're searching for example the following words: theword and other. You can define a regular expression, then use LINQ to get the links with an attribute href matching your regular expression like this:
Regex regex = new Regex("(theworld|other)", RegexOptions.IgnoreCase);
HtmlNode node = htmlDoc.DocumentNode.SelectSingleNode("//div[#class='a']");
List<HtmlNode> nodeList = node.SelectNodes(".//a").Where(a => regex.IsMatch(a.Attributes["href"].Value)).ToList<HtmlNode>();
List<string> urls = new List<string>();
foreach (HtmlNode n in nodeList)
{
urls.Add(n.Attributes["href"].Value);
}
Note that there's a contains keyword with XPATH, but you'll have to duplicate the condition for each word you're searching like:
node.SelectNodes(".//a[contains(#href,'theword') or contains(#href,'other')]")
There's also a matches keyword for XPATH, unfortunately it's only available with XPATH 2.0 and HtmlAgilityPack uses XPATH 1.0. With XPATH 2.0, you could do something like this:
node.SelectNodes(".//a[matches(#href,'(theword|other)')]")
I Find this and that works for me.
HtmlNode links = document.DocumentNode.SelectSingleNode("//*[#id='navigation']/div/ul");
foreach (HtmlNode urls in document.DocumentNode.SelectNodes("//a[#]"))
{
var temp = catagory.Attributes["href"].Value;
if (temp.Contains("some_word"))
{
dgv.Rows.Add(temp);
}
}

How to parse this HTML text using htmlagilitypack?

So below are the lines of code,
<td class="line1left">SCN02_MS_AddNotes_CAM</td><td class="line1left">798 (6.14%)
</td><td class="line1left">0.9</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
<td class="line1left">SCN05_MS_UpdateCustomer_CAM</td><td class="line1left">888 (6.83%)
</td><td class="line1left">1.0</td><td class="line1left">0s (<span> - %</span>)
</td><td class="line1left">0% (<span class="goodPercentage">-100%</span>)
</td>
From the first block, I need to get SCN02_MS_AddNotes_CAM and 798. To get 798 I am using this code, but I am getting the (6.14%) also, which I don't want.
var content1 = doc1.DocumentNode.SelectNodes("//td[#class='line1left']")[1].InnerText;
I want to get 798 only. So can anybody help me?
I also want to know how to get the same values from the second block. I was under the impression that the number inside the brackets represents the different occurrences of the class line1left. But here it is representing the different InnerHtml elements.
[1]
Does anybody know how to get this to work?
Thanks a lot in advance.!
var line1left_list = (from d in document.DocumentNode.Descendants()
where d.Name == "td " && d.Attributes["class"] != null
&& (d.Attributes["class"].Value == "line1left")
select d);
foreach (HtmlNode line1left in line1left_list)
{
var _link = line1left.Descendants("a").FirstOrDefault();
string linkUrl = "";
string link = "";
if (_link != null)
{
linkUrl = _link.Attributes["href"].Value;
link = _link.InnerText
}
}
It looks like you want the InnerText of all <td> tags with the class attribute of "line1left", unless that <td> has an <a> inside of it, in which case you want the InnerText of <a>.
Here is an example that will do just that. If the <td> has an <a>, then <a> is selected, otherwise <td> is selected.
HtmlDocument doc1 = new HtmlDocument();
doc1.Load("xmlfile2.xml");
var nodes = doc1.DocumentNode.SelectNodes("(//td[#class='line1left']/a) | (//td[#class='line1left' and not(a)])");
foreach(var node in nodes)
Console.WriteLine(node.InnerText.Trim());
This will select all the nodes in the document. You can use regular C# code to strip off the unwanted formatting on the individual values.

Retrieve all the ids for a given sentence using regex in c#

I am working on a .Net(C#) software which get and processes an html file. I need to get the id's of the html elements from that file and i want to use regular expression for that. I've tried some combinations but with no luck.
For example, if I have the line:
<a href="#" id="thisAnchor" >Link to somewhere</a><div id="divToCollect">BigDiv</div>
I want to get: thisAnchor and divToCollect. I am using Regex:
Regex.Matches(currentLine, expression);
You should not use regex for that, use HtmlAgilityPack and you will have no problems getting all the attributes you need:
string html = "<div id='divid'></div><a id='ancorid'></a>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var divIds = doc.DocumentNode
.Descendants("div")
.Where(div => div.Attributes["id"] != null)
.Select(div => div.Attributes["id"].Value)
.ToList();

C# parse html with xpath

I'm trying to parse out stock exchange information whit a simple piece of C# from a HTML document. The problem is that I can not get my head around the syntax, the tr class="LomakeTaustaVari" gets parsed out but how do I get the second bit that has no tr-class?
Here's a piece of the HTML, it repeats it self whit different values.
<tr class="LomakeTaustaVari">
<td><div class="Ensimmainen">12:09</div></td>
<td><div>MSI</div></td>
<td><div>POH</div></td>
<td><div>42</div></td>
<td><div>64,50</div></td>
</tr>
<tr>
<td><div class="Ensimmainen">12:09</div></td>
<td><div>SRE</div></td>
<td><div>POH</div></td>
<td><div>156</div></td>
<td><div>64,50</div></td>
</tr>
My C# code:
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load ("https://www.op.fi/op/henkiloasiakkaat/saastot-ja-sijoitukset/kurssit-ja-markkinat/markkinat?sivu=alltrades.html&sym=KNEBV.HSE&from=10:00&to=19:00&id=32453");
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr[#class='LomakeTaustaVari']"))
{
Console.WriteLine(row.InnerText);
}
Console.ReadKey();
}
Try to use the next xpath //tr[preceding-sibling::tr[#class='LomakeTaustaVari']]:
var nodes = doc.DocumentNode.SelectNodes("//tr[preceding-sibling::tr[#class='LomakeTaustaVari']]");
It should select nodes that have preceding node tr with class LomakeTaustaVari.
Just FYI: if no nodes found, SelectNodes method returns null.
If you manage to get a reference to the <tr class="LomakeTaustaVari"> element, I see two possible solutions.
You can navigate to the parent and then find all its <tr> children:
lomakeTaustaVariElement.Parent.SelectNodes("tr"); // iterate over these if needed
You can also use NextSibling to get the next <tr>:
var trWithoutClass = lomakeTaustaVariElement.NextSibling;
Please note that using the second alternative you may run into issues, because whitespace present in the HTML may be interpreted as being a distinct element.
To overcome this, you may recursively call NextSibling until you encounter a tr element.
This will iterate over all nodes in document. You will probably also need to be more specific with starting node, so you will only select that you are interested in.
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("//tr"))
{
Console.WriteLine(row.InnerText);
}
Probably I don't understand something, but the simplest XPath for any tr element selection should do the work:
doc.DocumentNode.SelectNodes("//tr")
Otherwise, in case you would like to select elements with specific class attributes only, it could be:
doc.DocumentNode.SelectNodes("//tr[#class = 'someClass1' or #class = 'someClass2']")
If you do not like to load the page and want to use a ready html string, e.g. from a WebBrowser element, you can use the following example:
var web = new HtmlAgilityPack.HtmlDocument();
web.LoadHtml(webBrowser1.Document.Body.Parent.OuterHtml);
var q = web.DocumentNode.SelectNodes("/html/body/div[2]/div/div[1]") //XPath /html/body/div[2]/div/div[1]

Categories