Unable to display contents of node from XML online - c#

I have a XML document online which I'm trying to obtain and display the value of a particular node in a label.
XML:
<h1 class="contents-header">Upper Limit Management</h1>
<div class="contents-body">
<table class="data-table" width="500px">
<tr>
<th scope="col" width="165px">Max. Allowance Set</th>
<th scope="col" width="165px">Max. Allowance</th>
<th scope="col" width="170px">Meter Count</th>
</tr>
<tr>
<td>Disable</td>
<td>0</td>
<td>32</td>
</tr>
</table>
</div>
<hr class="contents-end" title="">
</div>
And here's the C# code I've tried to obtain and display it. I'm trying to display the node "Meter Count" which has a value of 32.
private void loadxmlbtn_Click(object sender, EventArgs e)
{
string myXmlString = new WebClient().DownloadString("http://10.86.192.24/system.xml");
XmlDocument xml = new XmlDocument();
xml.LoadXml(myXmlString);
XmlNodeList xnList = xml.SelectNodes("/Upper Limit Management/Meter Count");
foreach (XmlNode xn in xnList)
{
string count = xn["td"].InnerText;
countlbl.Text = "Your current count is: " + count;
}
}
However, on the button click, nothing is happening. Is anyone able to point out what might be going wrong?

HTML is not the same as XML, you will have to use something like a HTML Parser to parse the HTML to XML.

Related

How can I find correct XPath

I want to scrape data from a website, but I'm getting the following error.
string testurl = "https://www.basketball-reference.com/boxscores/202110190MIL.html";
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(testurl);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("//*[#id=\"div_four_factors\"]");
int nodeCt = htmlNodes.Count;
Here is the error that I'm getting:
System.NullReferenceException: 'Object reference not set to an instance of an object.'
This is a snippet from the Page Source:
<div class="table_container" id="div_four_factors">
<table class="suppress_all stats_table" id="four_factors" data-cols-to-freeze=",1">
<caption>Four Factors Table</caption>
<tr class="over_header">
<th aria-label="" data-stat="" colspan="2" class=" over_header center" ></th>
<th aria-label="" data-stat="header_tmp" colspan="4" class=" over_header center" >Four Factors</th><th></th>
</tr>
<tr>
<th aria-label=" " data-stat="team_id" scope="col" class=" poptip sort_default_asc left" data-tip="Team" > </th>
<th aria-label="Pace Factor" data-stat="pace" scope="col" class=" poptip right" data-tip="<b>Pace Factor</b>: An estimate of possessions per 48 minutes" >Pace</th>
<th aria-label="Effective Field Goal Percentage" data-stat="efg_pct" scope="col" class=" poptip right" data-tip="<strong>Effective Field Goal Percentage</strong><br>This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal." data-over-header="Four Factors" >eFG%</th>
<th aria-label="Turnover Percentage" data-stat="tov_pct" scope="col" class=" poptip sort_default_asc right" data-tip="<b>Turnover Percentage</b><br>An estimate of turnovers committed per 100 plays." data-over-header="Four Factors" >TOV%</th>
<th aria-label="Offensive Rebound Percentage" data-stat="orb_pct" scope="col" class=" poptip right" data-tip="<b>Offensive Rebound Percentage</b><br>An estimate of the percentage of available offensive rebounds a player grabbed while they were on the floor." data-over-header="Four Factors" >ORB%</th>
<th aria-label="FT/FGA" data-stat="ft_rate" scope="col" class=" poptip right" data-tip="Free Throws Per Field Goal Attempt" data-over-header="Four Factors" >FT/FGA</th>
<th aria-label="Offensive Rating" data-stat="off_rtg" scope="col" class=" poptip right" data-tip="<b>Offensive Rating</b><br>An estimate of points produced (players) or scored (teams) per 100 possessions" >ORtg</th>
</tr>
</thead>
BRK101.8.54211.310.9.155102.1
MIL101.8.5385.825.0.133124.7
Since the element you're looking for is within a comment block in the page source (i.e. not the rendered HTML) it's a little more tricky.
First you need to find the comment, get the contents of it and parse it as HTML:
string testurl = "https://www.basketball-reference.com/boxscores/202110190MIL.html";
var web = new HtmlWeb();
var htmlDoc = web.Load(testurl);
var comment = htmlDoc.DocumentNode
// Get all comments.
// If you know where the comment is you can optimize this XPath:
.SelectNodes("//comment()")
// Only get the one(s) with the element you're looking for
.Where(c => c.OuterHtml.Contains("div_four_factors"))
.First(); // In this example I assume there's only one
// Remove comment characters: "<!--" and "-->"
var commentContent = comment.OuterHtml[4..^3];
// Load the HTML within the comment
var commentDoc = new HtmlDocument();
commentDoc.LoadHtml(commentContent);
// Find your node
var htmlNodes = commentDoc.DocumentNode.SelectNodes("//*[#id=\"div_four_factors\"]");
Using the LoadFromBrowser also works. Here is the code I am using.
//***********Load FRom Browser
string[] result1 = new string[] { };
HtmlWeb web1 = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc1 = new HtmlAgilityPack.HtmlDocument();
doc1 = web1.LoadFromBrowser(url, html =>
{
// Wait for the HTML element to exist
return !html.Contains("//*[#id=\"div_four_factors\"]");
});
var _extractText = doc1.DocumentNode.SelectSingleNode("//*[#id=\"div_four_factors\"]").InnerText;
Console.WriteLine(_extractText);

Get HTML values from web response

I am trying to parse an HTML response for a couple of values and then insert them into SQL. I am able to get both values but, because the code is wrapped in a foreach statement, I get them twice.
Here is my HTML response
<div align="CENTER" class='dataTitle'>Host State Breakdowns:</div>
<p align='center'>
<a href='trends.cgi?host=hostname&includesoftstates=no&assumeinitialstates=yes&initialassumedhoststate=0&backtrack=4'><img src='trends.cgi?createimage&host=hostname&includesoftstates=no&initialassumedhoststate=0&backtrack=4' border="1" alt='Host State Trends' title='Host State Trends' width='500' height='20'></a><br>
</p>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>State</th><th class='data'>Type / Reason</th><th class='data'>Time</th><th class='data'>% Total Time</th><th class='data'>% Known Time</th></tr>
<tr class='dataEven'><td class='hostUP' rowspan="3">UP</td><td class='dataEven'>Unscheduled</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
<tr class='dataEven'><td class='dataEven'>Scheduled</td><td class='dataEven'>0d 0h 0m 0s</td><td class='dataEven'>0.000%</td><td class='dataEven'>0.000%</td></tr>
<tr class='hostUNREACHABLE'><td class='hostUNREACHABLE'>Total</td><td class='hostUNREACHABLE'>0d 0h 0m 0s</td><td class='hostUNREACHABLE'>0.000%</td><td class='hostUNREACHABLE'>0.000%</td></tr>
<tr class='dataOdd'><td class='dataOdd' rowspan="3">Undetermined</td><td class='dataOdd'>Nagios Not Running</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Insufficient Data</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr class='dataOdd'><td class='dataOdd'>Total</td><td class='dataOdd'>0d 0h 0m 0s</td><td class='dataOdd'>0.000%</td><td class='dataOdd'></td></tr>
<tr><td colspan="3"></td></tr>
<tr class='dataEven'><td class='dataEven'>All</td><td class='dataEven'>Total</td><td class='dataEven'>0d 10h 5m 19s</td><td class='dataEven'>100.000%</td><td class='dataEven'>100.000%</td></tr>
</table>
</div>
<br><br>
<div align="CENTER" class='dataTitle'>State Breakdowns For Host Services:</div>
<div align="CENTER">
<table border="0" class='data'>
<tr><th class='data'>Service</th><th class='data'>% Time OK</th><th class='data'>% Time Warning</th><th class='data'>% Time Unknown</th><th class='data'>% Time Critical</th><th class='data'>% Time Undetermined</th></tr>
<tr class='dataOdd'><td class='dataOdd'><a href='avail.cgi?host=hostname&service=servicename&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataOdd'>0.000%</td></tr>
<tr class='dataEven'><td class='dataEven'><a href='avail.cgi?host=hostname&service=servicename2&t1=1478498400&t2=1478534719&backtrack=4&assumestateretention=yes&assumeinitialstates=yes&assumestatesduringnotrunning=yes&initialassumedhoststate=0&initialassumedservicestate=0&show_log_entries&showscheduleddowntime=yes&rpttimeperiod=24x7'>servicename2</a></td><td class='serviceOK'>100.000% (100.000%)</td><td class='serviceWARNING'>0.000% (0.000%)</td><td class='serviceUNKNOWN'>0.000% (0.000%)</td><td class='serviceCRITICAL'>0.000% (0.000%)</td><td class='dataEven'>0.000%</td></tr>
</table>
</div>
Here is my code:
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//table[#class]"))
{
foreach (HtmlNode node2 in node.SelectNodes("//td[#class = 'serviceOK']"))
{
var value = node2.InnerText;
}
foreach (HtmlNode node3 in node.SelectNodes("//a[contains(#href, 'avail.cgi')]"))
{
var name = node3.InnerText;
}
}
name shows the servicename and value shows the class serviceOK but it repeats itself again because of the first foreach.
My results look like this:
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
100.000% (100.000%)
100.000% (100.000%)
servicename
servicename2
Is there a way to, first, match the values up, and two, only have them show once?
Your first foreach traverses the entire document as do both of your other foreach statements inside of the first.
Because there are 2 table elements matching your XPath expression
"//table[#class]"
you are getting your answer twice. If you had more table elements matching your XPath expression, say 7 for example, you would get the result 7 times.
What you want is to find all table divisions (td) with class "serviceOK" that are within a table row (tr) within a table.
Once you have this HtmlNode you can just go to the previous sibling which will contain the service name.
var response = (HttpWebResponse)request.GetResponse();
var stream = response.GetResponseStream();
HtmlDocument doc = new HtmlDocument();
doc.Load(stream);
foreach (HtmlNode serviceOkNode in doc.DocumentNode.SelectNodes("//table[#class]/tr/td[#class = 'serviceOK']"))
{
HtmlNode serviceNameNode = serviceOkNode.PreviousSibling;
var value = serviceOkNode.InnerText;
var name = serviceNameNode.InnerText;
}

Selecting all textnodes in table with XPath

This is a a page from an open databse about food:
http://www.dabas.com/ProductSheet/Details.ashx/121308
Im trying to get some info from this page using XPath.
The table I'm interested in is the one called: Näringsvärde.
I want to get all the textnodes inside "Näringsvärde" saved into a string.
This is the relevant portion of the code linked above:
<!DOCTYPE html>
<html>
...
<body>
...
<table class="width100" style="page-break-inside: avoid">
<caption>
Produktinformation
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleProduktinformation"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyProduktinformation">
<tr>
<td class="col1">
Ursprungsland:
</td>
<td>
Sverige </td>
</tr>
...
</tbody>
</table>
<table id="tableHover" class="width100 marginTop30 bgTable">
<tr class="nohover">
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
<td class="padding">
<table class="noBorder width100">
<tr>
<td class="col1">
Tillagningsstatus:
</td>
<td>Tillagad</td>
<td colspan="2">
&amp;nbsp;
</td>
</tr>
...
</table>
</td>
</tr>
</tbody>
</table>
</td>
...
</html>
I tried using something like this so far, but it didn't work:
public List<string> GetNaring(string xid) {
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(xid);
var xpath = "/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody";
var links = doc.DocumentNode.SelectNodes(xpath);
return links.Select(n => n.InnerText).ToList();
}
But this only gives back null, what am I missing?
The XPath expression:
/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody
does not match any nodes.
Since you have an unique string you can match, you should use it. Searching for that string in the source code, you will find:
...
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
...
The string is a child of the caption element inside the table you want. You have to get the string value of that element, trim the extra spaces and use the result to compare to "Näringsvärde". You can select the correct table using this expression:
//table[normalize-space(caption/text())='Näringsvärde']
Once you have the correct table, you can navigate inside it and select the nodes you want, or you can get the string-value which is a concatenation of all the descendant text nodes:
//table[normalize-space(caption/text())='Näringsvärde']//td
This will return all td nodes, which is where the text is.

Getting text inside table

I have a table like that. And I wanna get the just text FOO COMPANY from between td tags. How can I get it?
<table class="left_company">
<tr>
<td style="BORDER-RIGHT: medium none; bordercolor="#FF0000" align="left" width="291" bgcolor="#FF0000">
<table cellspacing="0" cellpadding="0" width="103%" border="0">
<tr style="CURSOR: hand" onclick="window.open('http://www.foo.com')">
<td class="title_post" title="FOO" valign="center" align="left" colspan="2">
<font style="font-weight: 700" face="Tahoma" color="#FFFFFF" size="2">***FOO COMPANY***</font>
</td>
</tr>
</table>
</td>
</tr>
<table>
I'm using following code but nS is null.
doc = hw.Load("http://www.foo.aspx?page=" + j);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//table[#class='left_company']"))
{
nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']");
}
var text = doc.DocumentNode.Descendants()
.FirstOrDefault(n => n.Attributes["class"] != null &&
n.Attributes["class"].Value == "title_post")
.Element("font").InnerText;
or
var text2 = doc.DocumentNode.SelectNodes("//td[#class='title_post']/font")
.First().InnerText;
Likely the page you are calling generate the content of interest using JavaScript. HtmlAgilityPack does not execute JavaScript, so the content cannot be extracted. One way to confirm this is to try to visit the page with scripting turned off, and try to see if the element you are interested in still exists.
insert some attribute to font element like company="FOO"
then use jquery to get that element like
alert($('font[company="FOO"]').html())
like this
cheers
Close: nS = doc.DocumentNode.SelectNodes("//td[#class='title_post']//text()");
You can then open the nS node to retrieve the text. If there's more than one text node, you'll need to iterate over them.

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories