How to select nodes by attribute that starts with... in C# - c#

I have this xml document and I want to select nodes by attribute that starts with '/employees/'.
<table>
<tr>
<td>
Employee 1
</td>
<td>Robert</td>
</tr>
<tr>
<td>
Employee 2
</td>
<td>Jennifer</td>
</tr>
</table>
So in C#, I would do something like this:
parentNode.SelectNodes("//table/tr/th/a[#href='/employees/.....']")
Is this possible with C#?
Thanks!

The simple starts-with function does what you need:
parentNode.SelectNodes("//table/tr/td/a[starts-with(#href, '/employees/')]")

using pure LINQ you can do something like this
var doc = XDocument.Parse("YOUR_XML_STRING");
var anchors = from e in doc. Descendants("a") where e.Attribute("href").Value.StartsWith("/employee/") select e;
// now you can seelect any node by doing a combination of .Parent.Parent.....

So, something like this?
var xml = #"<table>
<tr>
<td>
Employee 1
</td>
<td>Robert</td>
</tr>
<tr>
<td>
Employee 2
</td>
<td>Jennifer</td>
</tr>
</table>";
var doc = new XmlDocument();
doc.LoadXml(xml);
var employees = doc.SelectNodes("/table/tr/td/a[starts-with(#href, '/employees/')]");
DoWhatever(employees);

Sure, you can load your XML into the XDocument instance and use XPathSelectElements method to search using your expression.

Related

how can i get following-sibling using HtmlAgilityPack?

i have many tr tags in html code:
<div class="noticeTabBoxWrapper">
<tr>
<td>
<span>Text for anchor</span>
</td>
</tr>
<tr>
<td>
<span>*constantly changing text*</span>
</td>
</tr>
In my code i write this:
//div[#class = 'noticeTabBoxWrapper']//span[contains(text(), 'Text for anchor')]").InnerText
How can I rewrite code so that I can extract the necessary text that follows immediately after the anchor?
Thanks.
Assuming raw is your sample data :
var doc = new HtmlDocument();
doc.LoadHtml(raw);
var xpath = "//div[#class='noticeTabBoxWrapper']//span[contains(., 'Text for anchor')]/following::td[1]/span";
var result = doc.DocumentNode.SelectSingleNode(xpath);
Console.WriteLine(result.InnerText)
Output : *constantly changing text*

c# - html "nested table" in class

WebClient client = new WebClient();
var data = client.DownloadString("a web link");
and i am getting an HTML page in which there's a table like this
<table>
<tr>
<td> Team 1 ID </td>
<td> Team 1 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
<tr>
<td> Member 2 name </td>
<td> Member 2 age </td>
</tr>
</table>
</td>
</tr>
<tr>
<td> Team 2 ID </td>
<td> Team 2 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
</table>
</td>
</tr>
that means another table in each row of main table so i called it nested table.
whatever, now i want to get these data into class like this
class Team
{
public int teamID;
public string teamName;
public struct Member
{
public string memberName;
public int memberAge;
}
public Member member1;
public Member member2;
}
note that, each team might have 0 to 3 members
so i am seeking for a sound solution that can solve my problem.
should i use RegEx or HtmlAgilityPack or which way is appropriate and how?
thanks in advance
Just use HtmlAgilityPack. If you run into any troubles, I can help you.
Regular expressions can only match regular languages but HTML is a
context-free language. The only thing you can do with regexps on HTML
is heuristics but that will not work on every condition. It should be
possible to present a HTML file that will be matched wrongly by any
regular expression.
Using regular expressions to parse HTML: why not?
It will be easier if your html contains any identifiers (css classes or id)
Updated code: Here is my suggestion to approach your problem
string mainURL = "your url";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(mainURL);
var tables = doc.DocumentNode.Descendants("table").Where(_ => _.Descendants("table").Any());//this will give you all tables which contain another table inside
foreach (var table in tables)
{
var rows = table.ChildNodes.Where(_ => _.Name.Equals("tr"));//get all tr children (not grand children)
foreach (var row in rows)
{
for (int i = 0; i < row.ChildNodes.Count; i++)
{
if (row.ChildNodes[i].Name.Equals("td"))
{
//you can put your logic here, for eg i == 0, assign it to TeamID properties etc...
}
if (row.ChildNodes[i].Name.Equals("table"))
{
//here is your logic to handle nested table
}
}
}
}

Selecting all textnodes in table with XPath

This is a a page from an open databse about food:
http://www.dabas.com/ProductSheet/Details.ashx/121308
Im trying to get some info from this page using XPath.
The table I'm interested in is the one called: Näringsvärde.
I want to get all the textnodes inside "Näringsvärde" saved into a string.
This is the relevant portion of the code linked above:
<!DOCTYPE html>
<html>
...
<body>
...
<table class="width100" style="page-break-inside: avoid">
<caption>
Produktinformation
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleProduktinformation"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyProduktinformation">
<tr>
<td class="col1">
Ursprungsland:
</td>
<td>
Sverige </td>
</tr>
...
</tbody>
</table>
<table id="tableHover" class="width100 marginTop30 bgTable">
<tr class="nohover">
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
<td class="padding">
<table class="noBorder width100">
<tr>
<td class="col1">
Tillagningsstatus:
</td>
<td>Tillagad</td>
<td colspan="2">
&amp;nbsp;
</td>
</tr>
...
</table>
</td>
</tr>
</tbody>
</table>
</td>
...
</html>
I tried using something like this so far, but it didn't work:
public List<string> GetNaring(string xid) {
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(xid);
var xpath = "/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody";
var links = doc.DocumentNode.SelectNodes(xpath);
return links.Select(n => n.InnerText).ToList();
}
But this only gives back null, what am I missing?
The XPath expression:
/html/body/div/div[2]/div[2]/table[2]/tbody/tr/td/table/tbody
does not match any nodes.
Since you have an unique string you can match, you should use it. Searching for that string in the source code, you will find:
...
<td class="tdLeft48 padding0">
<table id="nutritiveTabel" class="leftTable" style="page-break-inside: avoid">
<caption>
Näringsvärde
<img src="../../images/ProductSheet/draw-triangle3.png" id="toggleNutritiveValues"
class="imgCaptionOn" />
</caption>
<tbody id="tbodyNutritiveValues">
<tr id="divNutritiveValues">
...
The string is a child of the caption element inside the table you want. You have to get the string value of that element, trim the extra spaces and use the result to compare to "Näringsvärde". You can select the correct table using this expression:
//table[normalize-space(caption/text())='Näringsvärde']
Once you have the correct table, you can navigate inside it and select the nodes you want, or you can get the string-value which is a concatenation of all the descendant text nodes:
//table[normalize-space(caption/text())='Näringsvärde']//td
This will return all td nodes, which is where the text is.

How to get next 2 nodes in HTML + HTMLAgilitypack

I have a table in the HTML code below:
<table style="padding: 0px; border-collapse: collapse;">
<tr>
<td><h3>My Regional Financial Office</h3></td>
</tr>
<tr>
<td> </td>
</tr>
<tr>
<td><h3>My Address</h3></td>
</tr>
<tr>
<td>000 Test Ave S Ste 000</td>
</tr>
<tr>
<td>Golden Valley, MN 00000</td>
</tr>
<tr>
<td>Get Directions</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
How can I get the inner text of the next 2 <tr> tags after the tablerow containing the text "My Address?"
You can use following XPath :
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var tdOfInterests =
htmlDoc.DocumentNode
.SelectNodes("//tr[td/h3[.='My Address']]/following-sibling::tr[position() <= 2]/td");
foreach (HtmlNode td in tdOfInterests)
{
//given html input in question following code will print following 2 lines:
//000 Test Ave S Ste 000
//Golden Valley, MN 00000
Console.WriteLine(td.InnerText);
}
The key of above XPath is using following-sibling with position() filter.
UPDATE :
A bit explanation about the XPath used in this answer :
//tr[td/h3[.='My Address']]
above part select <tr> element that has :
child <td> element that has child <h3> element with value equals
'My Address'
/following-sibling::tr[position() <= 2]
next part select following <tr> element with position <=2 from current <tr> element (the one selected by previous XPath part)
/td
the last part select child <td> element from current <tr> element

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories