c# - html "nested table" in class - c#

WebClient client = new WebClient();
var data = client.DownloadString("a web link");
and i am getting an HTML page in which there's a table like this
<table>
<tr>
<td> Team 1 ID </td>
<td> Team 1 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
<tr>
<td> Member 2 name </td>
<td> Member 2 age </td>
</tr>
</table>
</td>
</tr>
<tr>
<td> Team 2 ID </td>
<td> Team 2 Name </td>
<td>
<table>
<tr>
<td> Member 1 name </td>
<td> Member 1 age </td>
</tr>
</table>
</td>
</tr>
that means another table in each row of main table so i called it nested table.
whatever, now i want to get these data into class like this
class Team
{
public int teamID;
public string teamName;
public struct Member
{
public string memberName;
public int memberAge;
}
public Member member1;
public Member member2;
}
note that, each team might have 0 to 3 members
so i am seeking for a sound solution that can solve my problem.
should i use RegEx or HtmlAgilityPack or which way is appropriate and how?
thanks in advance

Just use HtmlAgilityPack. If you run into any troubles, I can help you.
Regular expressions can only match regular languages but HTML is a
context-free language. The only thing you can do with regexps on HTML
is heuristics but that will not work on every condition. It should be
possible to present a HTML file that will be matched wrongly by any
regular expression.
Using regular expressions to parse HTML: why not?
It will be easier if your html contains any identifiers (css classes or id)
Updated code: Here is my suggestion to approach your problem
string mainURL = "your url";
HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(mainURL);
var tables = doc.DocumentNode.Descendants("table").Where(_ => _.Descendants("table").Any());//this will give you all tables which contain another table inside
foreach (var table in tables)
{
var rows = table.ChildNodes.Where(_ => _.Name.Equals("tr"));//get all tr children (not grand children)
foreach (var row in rows)
{
for (int i = 0; i < row.ChildNodes.Count; i++)
{
if (row.ChildNodes[i].Name.Equals("td"))
{
//you can put your logic here, for eg i == 0, assign it to TeamID properties etc...
}
if (row.ChildNodes[i].Name.Equals("table"))
{
//here is your logic to handle nested table
}
}
}
}

Related

How do I select certain 'Nodes' in a text file - based off of what a certain line contains using HTMLAgilityPack?

I know the Title says a lot, don't worry, I'll break it down for you.
Ok so I have one .txt file with the words - Horsemen - in it, called TeamName.txt
I have another 6 .txt files with HTML code which my code fetches and downloads - this is called Ladder-1-100.txt - NOW! The easy part:
Here's the idea, the code sifts through the HTML ladder.txt file for the team name, which my code does now fine. BUT, I want it to pull out other information too, whilst inside that specific #class. Not Specfic enough in my explaination? I'll show you.
<tr class="vrml_table_row">
<td class="pos_cell">59</td>
<td class="div_cell"><img src="/images/div_gold_40.png" title="Gold" /></td>
<td class="team_cell"><img src="/images/logos/teams/9b3b1917-a56b-40a3-80ee-52b1c9f31910.png" class="team_logo" /><span class="team_name">Echoholics</span></td>
<td class="group_cell"><img src="/images/group_ame.png" class="group_logo" title="America East" /></td>
<td class="gp_cell">14</td>
<td class="win_cell">10</td>
<td class="loss_cell">4</td>
<td class="pts_cell">340</td>
<td class="mmr_cell"><span>1200</span></td>
</tr>
<tr class="vrml_table_row">
<td class="pos_cell">60</td>
<td class="div_cell"><img src="/images/div_diamond_40.png" title="Diamond" /></td>
<td class="team_cell"><img src="/images/logos/teams/dff8310a-a429-4c60-af82-0333d530d22d.png" class="team_logo" /><span class="team_name">Horsemen</span></td>
<td class="group_cell"><img src="/images/group_aa.png" class="group_logo" title="Oceania/Asia" /></td>
<td class="gp_cell">10</td>
<td class="win_cell">6</td>
<td class="loss_cell">4</td>
<td class="pts_cell">235</td>
<td class="mmr_cell"><span>1200</span></td>
</tr>
<tr class="vrml_table_row">
<td class="pos_cell">61</td>
<td class="div_cell"><img src="/images/div_gold_40.png" title="Gold" /></td>
<td class="team_cell"><img src="/images/logos/teams/8eb6109e-f765-4d64-a766-cc5605a01ad0.png" class="team_logo" /><span class="team_name">Femboys</span></td>
<td class="group_cell"><img src="/images/group_ame.png" class="group_logo" title="America East" /></td>
<td class="gp_cell">12</td>
<td class="win_cell">8</td>
<td class="loss_cell">4</td>
<td class="pts_cell">348</td>
<td class="mmr_cell"><span>1200</span></td>
</tr>
Here is my current code that will spit out: Team Name: Horsemen.
HtmlNode[] team_name = document1.DocumentNode
.SelectSingleNode("//*[#class='vrml_table_row']")
.SelectNodes("//td[#class='team_cell']")
.Where(x => x.InnerHtml.Contains($"{TeamName}"))
.ToArray();
foreach (HtmlNode item in team_name)
{
await ReplyAsync("**Team Name:** " + item.InnerHtml);
}
However, I want it to spit out:
Team Name: Horsemen, Wins: 6, Losses: 4, Games Played: 10, MMR: 1200, Points Scored: 235, Division: Diamond, Ladder Position: 60.
You get my point. As you can see, each of those classes are labeled the same, expect for their information inside. By the way, the Team Name - Horsemen - is Dynamic, meaning it can be replaced with another team name. So how do I acheive this?
A sample solution would be this one:
Firstly create a Model class
class Model
{
public int Position { get; set; }
public string TeamName { get; set; }
public string ImageSource { get; set; }
public string Division { get; set; }
//whatever you want to store
}
After that should keep the desired nodes in HtmlNodeCollection and our model in a List :
var table = htmlDoc.DocumentNode.SelectNodes("//tr[contains(#class, 'vrml_table_row')]");
var models = new List<Model>();
foreach (var t in table)
{
var model = new Model
{
//I used the first 8 columns of the desired table
Position = int.Parse(t.SelectSingleNode("td[contains(#class, 'pos_cell')]").InnerText),
ImageSource = t.SelectSingleNode("td[contains(#class, 'div_cell')]/img").Attributes["src"].Value,
Division = t.SelectSingleNode("td[contains(#class, 'div_cell')]/img").Attributes["title"].Value,
TeamLink = t.SelectSingleNode("td[contains(#class, 'team_cell')]/a").Attributes["href"].Value,
TeamLogo = t.SelectSingleNode("td[contains(#class, 'team_cell')]/a/img").Attributes["src"].Value,
TeamName = t.SelectSingleNode("td/a/span[contains(#class, 'team_name')]").InnerText,
GroupLogo = t.SelectSingleNode("td[contains(#class, 'group_cell')]/img").Attributes["src"].Value,
GroupTitle = t.SelectSingleNode("td[contains(#class, 'group_cell')]/img").Attributes["title"].Value
// etc
};
models.Add(model);
}

How to find last column of a table using Html Agility Pack

I have a table like this:
<table border="0" cellpadding="0" cellspacing="0" id="table2">
<tr>
<th>Name
</th>
<th>Age
</th>
</tr>
<tr>
<td>Mario
</td>
<th>Age: 78
</td>
</tr>
<tr>
<td>Jane
</td>
<td>Age: 67
</td>
</tr>
<tr>
<td>James
</td>
<th>Age: 92
</td>
</tr>
</table>
I want to get the last td from all rows using Html Agility Pack.
Here is my C# code so far:
await page.GoToAsync(NumOfSaleItems, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[] { WaitUntilNavigation.DOMContentLoaded }
});
var html4 = page.GetContentAsync().GetAwaiter().GetResult();
var htmlDoc4 = new HtmlDocument();
htmlDoc4.LoadHtml(html4);
var SelectTable = htmlDoc4.DocumentNode.SelectNodes("/html/body/div[2]/div/div/div/table[2]/tbody/tr/td[1]/div[3]/div[2]/div/table[2]/tbody/tr/td[4]");
if (SelectTable.Count == 0)
{
continue;
}
else
{
foreach (HtmlNode row in SelectTable)//
{
string value = row.InnerText;
value = value.ToString();
var firstSpaceIndex = value.IndexOf(" ");
var firstString = value.Substring(0, firstSpaceIndex);
LastSellingDates.Add(firstString);
}
}
How can I get only the last column of the table?
I think the XPath you want is: //table[#id='table2']//tr/td[last()].
//table[#id='table2'] finds the table by ID anywhere in the document. This is preferable to a long brittle path from the root, since a table ID is less likely to change than the rest of the HTML structure.
//tr gets the descendent rows in the table. I'm using two slashes in case there might be an intervening <tbody> element in the actual HTML.
/td[last()] gets the last <td> in each row.
From there you just need to select the InnerText of each <td>.
var tds = htmlDoc.DocumentNode.SelectNodes("//table[#id='table2']//tr/td[last()]");
var values = tds?.Select(td => td.InnerText).ToList() ?? new List<string>();
Working demo here: https://dotnetfiddle.net/7I8yk1

C#, convert columns to one row in a list

Ok, here is the situation: I've got a C# partial class that returns a list of objects, using the .Select().ToList() functionality.
I get how this works.
The issue that I'm running into, though, is that there are four properties in the class that I need to use in one column. To compound this, each column is supposed to be tied to a row, which displays a lot of redundant information.
An example helps:
I currently have this, let's say, as my class:
public class MyClass{
public string PersonName {get;set;}
public string PersonAddress {get;set;}
public string Field1 {get;set;}
public string Field2 {get;set;}
public string Field3 {get;set;}
public string Field4 {get;set;}
}
That gives you a basic idea of my class. My original suggestion was do set the code up where it did this:
<table>
<tr>
<td>Name</td>
<td>Address</td>
<td>Field 1<br>Field 2<br>Field 3<br>Field 4</td>
</tr>
</table>
Which makes sense.
However, those above me have decided that they want a layout like this:
<table>
<tr>
<td>Name</td>
<td>Address</td>
<td>Field 1</td>
</tr>
<tr>
<td>Name</td>
<td>Address</td>
<td>Field 2</td>
</tr>
<tr>
<td>Name</td>
<td>Address</td>
<td>Field 3</td>
</tr>
<tr>
<td>Name</td>
<td>Address</td>
<td>Field 4</td>
</tr>
</table>
Where, if you'll notice, Name and Address are redundant for each row.
So, how could I do this in C#?
You can use LINQ SelectMany method to generate a list that contains 4 objects per a single MyClass object like this:
List<MyClass> list = ...
var result = list.SelectMany(x =>
new[]
{
new {x.PersonAddress, x.PersonName, Value = x.Field1},
new {x.PersonAddress, x.PersonName, Value = x.Field2},
new {x.PersonAddress, x.PersonName, Value = x.Field3},
new {x.PersonAddress, x.PersonName, Value = x.Field4}
}).ToList();
This will return a list of anonymous type objects.
Each one of these objects will contain a PersonAddress, a PersonName, and a Value property.
You can then loop over them to generate the HTML that you want.

How to select nodes by attribute that starts with... in C#

I have this xml document and I want to select nodes by attribute that starts with '/employees/'.
<table>
<tr>
<td>
Employee 1
</td>
<td>Robert</td>
</tr>
<tr>
<td>
Employee 2
</td>
<td>Jennifer</td>
</tr>
</table>
So in C#, I would do something like this:
parentNode.SelectNodes("//table/tr/th/a[#href='/employees/.....']")
Is this possible with C#?
Thanks!
The simple starts-with function does what you need:
parentNode.SelectNodes("//table/tr/td/a[starts-with(#href, '/employees/')]")
using pure LINQ you can do something like this
var doc = XDocument.Parse("YOUR_XML_STRING");
var anchors = from e in doc. Descendants("a") where e.Attribute("href").Value.StartsWith("/employee/") select e;
// now you can seelect any node by doing a combination of .Parent.Parent.....
So, something like this?
var xml = #"<table>
<tr>
<td>
Employee 1
</td>
<td>Robert</td>
</tr>
<tr>
<td>
Employee 2
</td>
<td>Jennifer</td>
</tr>
</table>";
var doc = new XmlDocument();
doc.LoadXml(xml);
var employees = doc.SelectNodes("/table/tr/td/a[starts-with(#href, '/employees/')]");
DoWhatever(employees);
Sure, you can load your XML into the XDocument instance and use XPathSelectElements method to search using your expression.

How to get a link's title and href value separately with html agility pack?

Im trying to download a page contain a table like this
<table id="content-table">
<tbody>
<tr>
<th id="name">Name</th>
<th id="link">link</th>
</tr>
<tr class="tt_row">
<td class="ttr_name">
<a title="name_of_the_movie" href="#"><b>name_of_the_movie</b></a>
<br>
<span class="pre">message</span>
</td>
<td class="td_dl">
<img alt="Download" src="#">
</td>
</tr>
<tr class="tt_row"> .... </tr>
<tr class="tt_row"> .... </tr>
</tbody>
</table>
i want to extract the name_of_the_movie from td class="ttr_name" and download link from td class="td_dl"
this is the code i used to loop through table rows
HtmlAgilityPack.HtmlDocument hDocument = new HtmlAgilityPack.HtmlDocument();
hDocument.LoadHtml(htmlSource);
HtmlNode table = hDocument.DocumentNode.SelectSingleNode("//table");
foreach (var row in table.SelectNodes("//tr"))
{
HtmlNode nameNode = row.SelectSingleNode("td[0]");
HtmlNode linkNode = row.SelectSingleNode("td[1]");
}
currently i have no idea how to check the nameNode and linkNode and extract data inside it
any help would be appreciated
Regards
I can't test it right now, but it should be something among the lines of :
string name= namenode.Element("a").Element("b").InnerText;
string url= linknode.Element("a").GetAttributeValue("href","unknown");
nameNode.Attributes["title"]
linkNode.Attributes["href"]
presuming you are getting the correct Nodes.
public const string UrlExtractor = #"(?: href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript|.*css|.*this\.)(?<url>.*?)(?:[\s>""'])";
public static Match GetMatchRegEx(string text)
{
return new Regex(UrlExtractor, RegexOptions.IgnoreCase).Match(text);
}
Here is how you can extract all Href Url. I'm using that regex in one of my projects, you can modify it to match your needs and rewrite it to match title as well. I guess it is more convenient to match them in bulk

Categories