Extracting table using Htmlagilitypack + LINQ + Lambda - c#

I'm having some difficulties using a lambda expression to parse an html table.
var cells = htmlDoc.DocumentNode
.SelectNodes("//table[#class='data stats']/tbody/tr")
.Select(node => new { playerRank = node.InnerText.Trim()})
.ToList();
foreach (var cell in cells)
{
Console.WriteLine("Rank: " + cell.playerRank);
Console.WriteLine();
}
I'd like to continue to use the syntax as
.Select(node => new { playerRank = node.InnerText.Trim()
but for the other categories of the table such as player name, team, position etc. I'm using Xpath, so I am unsure if its correct.
I'm having an issue finding out how to extract the link + player name from:
Steven Stamkos
The Xpath for it is:
//*[#id="fullPage"]/div[3]/table/tbody/tr[1]/td[2]/a
Can anyone help out?
EDIT* added HTML page.
http://www.nhl.com/ice/playerstats.htm?navid=nav-sts-indiv#

This should get you started:
var result = (from row in doc.DocumentNode.SelectNodes("//table[#class='data stats']/tbody/tr")
select new
{
PlayerName = row.ChildNodes[1].InnerText.Trim(),
Team = row.ChildNodes[2].InnerText.Trim(),
Position = row.ChildNodes[3].InnerText.Trim()
}).ToList();
The ChildNodes property contains all the cells per row. The index with determine which cell you get.
To get the url from the anchor tag contained in the player name cell:
var result = (from row in doc.DocumentNode.SelectNodes("//table[#class='data stats']/tbody/tr")
select new
{
PlayerName = row.ChildNodes[1].InnerText.Trim(),
PlayerUrl = row.ChildNodes[1].ChildNodes[0].Attributes["href"].Value,
Team = row.ChildNodes[2].InnerText.Trim(),
Position = row.ChildNodes[3].InnerText.Trim()
}).ToList();
The Attributes collection is a list of the attributes in an HTML element. We are simply grabbing the value of href.

Related

C# strange issue - unable to assign value from right to left variable

I have a list Rows which holds 10 different records. I am looping this list in C# console app and inserting values to another list but it only picks first record and inserts it 10 times to new list.
When I debug, unique values are shown in the loop but they are not being assigned to left variable.
List<Job> jobList=new List<Job>();
foreach (var row in rows)
{
Job job = new Job();
job.Title = row.SelectSingleNode("//h2[#class='jobtitle']").ChildNodes[1].Attributes["title"].Value;
job.summary = row.SelectSingleNode("//span[#class='summary']").InnerText
jobList.add(job);
}
Any idea, what is happening?
I also used garbage collector but still no improvement:
job = null;
GC.Collect();
GC.WaitForPendingFinalizers();
Here is updated code after #Andrew suggestion but it didn't work. Right side holds updated values but they are not being assigned to left side variables.
foreach (var row in rows)
{
try
{
var job = new Job();
var title = row.SelectSingleNode("//h2[#class='jobtitle']").ChildNodes[1].Attributes["title"].Value;
var company = row.SelectSingleNode("//span[#class='company']").InnerText.Replace("\n", "").Replace("\r", "");
var location = row.SelectSingleNode("//span[#class='location']").InnerText.Replace("\n", "").Replace("\r", "");
var summary = row.SelectSingleNode("//span[#class='summary']").InnerText.Replace("\n", "").Replace("\r", "");
job.Title = title;
job.Company = company;
job.Location = location;
job.Summary = summary;
jobList.Add(job);
job = null;
GC.Collect();
GC.WaitForPendingFinalizers();
counter++;
Status("Page# " + pageNumber.ToString() + " : Record# " + counter + " extracted");
}
catch (Exception)
{
AppendRecords(jobList);
jobList.Clear();
}
//save file
}
Hi You don't tell us what the rows variable relates to, but I assume these are nodes in a single XmlDocument. The XPath expressions you are using to extract values from these nodes is incorrect, because they will always navigate to the same node in the document irrespective of the current row node.
Here's a simple example that demonstrates the problem:-
static void Main(string[] args)
{
XmlDocument x = new XmlDocument();
x.LoadXml(#"<rows> <row><bla><h2>bob1</h2></bla></row> <row><bla><h2>bob2</h2></bla></row> </rows>");
var rows = x.GetElementsByTagName("row");
foreach (XmlNode row in rows)
{
var h2 = row.SelectSingleNode("//h2").ChildNodes[0].Value;
Console.WriteLine(h2);
}
}
The output from this will be
bob1
bob1
Not what you were expecting? Have a play with the example in Dot Net Fiddle. Take another look at your XPath expression. Your current expression //h2 is saying "give me all h2 elements in the document irrespective of the current node". Whereas .//h2 would give you the h2 elements that are descendants of the current row node, which is probably what you need.

Returning Value of Grouped Items when using Linq to XML

I am trying to use Linq to XML in Visual Studio - C# to pull all of the Elements in an XML file and group them by their value.
This is my XML code:
<?xml version="1.0" encoding="utf-8"?>
<topTerms>
<topTerm>Cat</topTerm>
<topTerm>Dog</topTerm>
<topTerm>Cat</topTerm>
<topTerm>Dog</topTerm>
<topTerm>Cat</topTerm>
<topTerm>Bird</topTerm>
<topTerm>Cat</topTerm>
</topTerms>
I am then using the following C# code to try and pull the data and group it by value of the topTerm element.
var top = 0;
var topName = "";
var topTermsUrl = Server.MapPath("XML/topTerms.xml");
XDocument topTermsFile = XDocument.Load(topTermsUrl);
var topTermDocuments = topTermsFile.Root
.Elements("topTerm")
.GroupBy(a => a.Value);
foreach (var topTerm in topTermDocuments)
{
topName = topTerm.Value;
top = topTerm.Count();
}
However, topTerm.Value is not working. It will count the number of occurrence for each value when I cycle through, but I cannot get the string value being counted. Any ideas?
topTerm is an IGrouping, particularly IGrouping<string, XElement>. So it has a property Key that contains the value by which you grouped:
foreach (var topTerm in topTermDocuments)
{
topName = topTerm.Key; // Key used here
top = topTerm.Count();
}
(I assume that in the full code you do more than overwriting variables in a loop).

Using LINQ to aggregate multiple nested elements in XDocument

I have the following XML (parsed into an XDocument):
XDocument nvdXML = XDocument.Parse(#"<entry id='CVE-2016-1926'>
<vulnerable-configuration id='http://www.nist.gov/'>
<logical-test operator='OR' negate='false'>
<fact-ref name='A'/>
<fact-ref name='B'/>
<fact-ref name='C'/>
</logical-test>
</vulnerable-configuration>
<vulnerable-configuration id='http://www.nist.gov/'>
<logical-test operator='OR' negate='false'>
<fact-ref name='X'/>
<fact-ref name='Y'/>
<fact-ref name='Z'/>
</logical-test>
</vulnerable-configuration></entry>");
I want to get a single collection/list of every "name" attribute for each entry (in this case there is only one entry, whose name list would consist of ['A','B','C','X','Y','Z'])
Here is the code I have:
var entries = from entryNodes in nvdXML.Descendants("entry")
select new CVE
{
//VulnerableConfigurations = (from vulnCfgs in entryNodes.Descendants(vulnNS + "vulnerable-configuration").Descendants(cpeNS + "logical-test")
// select new VulnerableConfiguration
// {
// Name = vulnCfgs.Element(cpeNS + "fact-ref").Attribute("name").Value
// }).ToList()
VulnerableConfigurations = (from vulnCfgs in entryNodes.Descendants("vulnerable-configuration")
from logicalTest in vulnCfgs.Descendants("logical-test")
select new VulnerableConfiguration
{
Name = logicalTest.Element("fact-ref").Attribute("name").Value
}).ToList()
};
Unfortunately, this (both commented and uncommented) query only results in VulnerableConfigurations ['A','X'], instead of ['A','B','C','X','Y','Z']
How do I modify my query such that every element of every list is selected (assuming there could be 1+ nested lists)?
Note, I did search for dup's, and although there are similar questions, most are very specific, and ask for grouping/summing/manipulation, or are not related to XML parsing.
Final working code (thanks to accepted answer):
var entries = from entryNodes in nvdXML.Descendants("entry")
select new CVE
{
VulnerableConfigurations = (from vulnCfgs in entryNodes.Descendants("fact-ref")
select new VulnerableConfiguration
{
Name = vulnCfgs.Attribute("name").Value
}).ToList()
};
You can try this if you have only one entry:
var entries =(from fact in nvdXML.Descendants("fact-ref")
select new VulnerableConfiguration
{
Name = fact.Attribute("name").Value
}).ToList();
The Descendants method is going to return all descendant elements that match with that name in document order.
And if you have more than one entry and you want to return a list for each entry, you can do the following:
var entries =(from entry in nvdXML.Descendants("entry")
select entry.Descendants("fact-ref").Select(f=>f.Attribute("name").Value).ToList()
).ToList();
In this case you are going to get a list of lists (List<List<string>>)
Update
Your issue was because you are flattering your query over the logical-test elements and in your xml you have two of them. Now in your select you are using Element method, which give you only one element, that's way you have A and X as result, that are the first fact-ref elements inside your logical-test elements

Get all <td> title of a table column with coded ui

I need to check a filter function on a table.
This filter is only on the first cell of each row and I'm trying to figure out how to get all those values...
I tried with something like
public bool CheckSearchResults(HtmlControl GridTable, string FilterTxt)
{
List<string> Elements = new List<string>();
foreach (HtmlCell cell in GridTable.GetChildren())
{
Elements.Add(cell.FilterProperties["title"]);
}
List<string> Results = Elements.FindAll(l => l.Contains(FilterTxt));
return Results.Count == Elements.Count;
}
but I get stuck at the foreach loop...
maybe there's a simply way with linq, but i don't know it so much
edit:
all the cells i need have the same custom html tag.
with this code i should get them all, but i don't know how to iterate
HtmlDocument Document = this.UIPageWindow.UIPageDocument;
HtmlControl GridTable = this.UIPageWindow.UIPageDocument.UIPageGridTable;
HtmlCell Cells = new HtmlCell(GridTable);
Cells.FilterProperties["custom_control"] = "firstCellOfRow";
also because there's no GetEnumerator function or query models for HtmlCell objects, which are part of Microsoft.VisualStudio.TestTools.UITesting.HtmlControl library -.-
edit2:
i found this article and i tried this
public bool CheckSearchResults(string FilterTxt)
{
HtmlDocument Document = this.UIPageWindow.UIPageDocument;
HtmlControl GridTable = this.UIPageWindow.UIPageDocument.UIPageGridTable;
HtmlRow rows = new HtmlRow(GridTable);
rows.SearchProperties[HtmlRow.PropertyNames.Class] = "ui-widget-content jqgrow ui-row-ltr";
HtmlControl cells = new HtmlControl(rows);
cells.SearchProperties["custom_control"] = "firstCellOfRow";
UITestControlCollection collection = cells.FindMatchingControls();
List<string> Elements = new List<string>();
foreach (UITestControl elem in collection)
{
HtmlCell cell = (HtmlCell)elem;
Elements.Add(cell.GetProperty("Title").ToString());
}
List<string> Results = Elements.FindAll(l => l.Contains(FilterTxt));
return Results.Count == Elements.Count;
}
but i get an empty collection...
Try Cell.Title or Cell.GetProperty("Title"). SearchProperties and FilterProperties are only there for searching for a UI element. They either come from the UIMap or from code if you fill them out with hand. Otherwise your code should should work.
Or you can use a LINQ query (?) like:
var FilteredElements =
from Cell in UIMap...GridTable.GetChildren()
where Cell.GetProperty("Title").ToString().Contains(FilterTxt)
select Cell;
You could also try to record a cell, add it to the UIMap, set its search or filter properties to match your filtering, then call UIMap...Cell.FindMatchingControls() and it should return all matching cells.
The problem now is that you are limiting your search for one row of the table. HtmlControl cells = new HtmlControl(rows); here the constructor parameter sets a search limit container and not the direct parent of the control. It should be the GridTable if you want to search all cells in the table. Best solution would be to use the recorder to get a cell control then modify its search and filter properties in the UIMap to match all cells you are looking for. Tho in my opinion you should stick with a hand coded filtering. Something like:
foreach(var row in GridTable.GetChildren())
{
foreach(var cell in row.GetChildren())
{
//filter cell here
}
}
Check with AccExplorer or the recorder if the hierarchy is right. You should also use debug to be sure if the loops are getting the right controls and see the properties of the cells so you will know if the filter function is right.
I resolved scraping pages html by myself
static public List<string> GetTdTitles(string htmlCode, string TdSearchPattern)
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlCode);
HtmlNodeCollection collection = doc.DocumentNode.SelectNodes("//td[#" + TdSearchPattern + "]");
List<string> Results = new List<string>();
foreach (HtmlNode node in collection)
{
Results.Add(node.InnerText);
}
return Results;
}
I'm freakin' hating those stupid coded ui test -.-
btw, thanks for the help

Looping throught XML element to add data

I dont know how exactly to word my question, so apologies from up front. I have an xml file and it has elements like the following:
- <Allow_BenGrade>
<Amount BenListID="0">0</Amount>
</Allow_BenGrade>
- <Add_Earnings_NonTaxable>
<Amount AddEarnID="0">0</Amount>
</Add_Earnings_NonTaxable>
I am interested in Allow_BenGrade where i can add multiple elements inside there. I have list of 3 items but when I loop through to write it to the file, it only writes the last item in the list, so instead of have 3 elements inside Allow_BenGrade, i end up having one (last one in the item list). My code is below. Please help thank you.
var query = from nm in xelement.Elements("EmployeeFinance")
select new Allowance {
a_empersonalID = (int)nm.Element("EmpPersonal_Id"),
a_allbengradeID = (int)nm.Element("Grade_Id")
};
var x = query.ToList();
foreach (var xEle in x)
{
var qryBenListGrade = from ee in context.Employee_Employ
join abg in context.All_Inc_Ben_Grade
on ee.Grade_Id equals abg.GradeID
join abl in context.All_Inc_Ben_Listing
on abg.All_Inc_Ben_ListingID equals abl.ID
where ee.Employee_Personal_InfoEmp_id == xEle.a_empersonalID && abg.GradeID == xEle.a_allbengradeID && (abl.Part_of_basic == "N" && abl.Status == "A" && abl.Type_of_earnings == 2)
//abl.Approved_on !=null &&
select new
{
abl.ID,
abl.Amount,
abg.GradeID,
ee.Employee_Personal_InfoEmp_id,
abl.Per_Non_Taxable,
abl.Per_Taxable
};
var y = qryBenListGrade.ToList();
//xEle.a_Amount = 0;
foreach (var tt in y)
{
Debug.WriteLine("amount: " + tt.Amount + " emp id: " + tt.Employee_Personal_InfoEmp_id + " ben list id: " + tt.ID);
// xEle.a_Amount = xEle.a_Amount + tt.Amount;
var result = from element in doc.Descendants("EmployeeFinance")
where int.Parse(element.Element("EmpPersonal_Id").Value) == tt.Employee_Personal_InfoEmp_id
select element;
foreach (var ele in result)
{
ele.Element("Allow_BenGrade").SetElementValue("Amount", tt.Amount);
//ele.Element("Allow_BenGrade").Element("Amount").SetAttributeValue("BenListID", tt.ID);
}
}
doc.Save(GlobalClass.GlobalUrl);
}
SetElementValue will, as the name suggests, set the value of the Amount element... You need to Add a new one instead:
ele.Element("Allow_BenGrade").Add(new XElement("Amount",
new XAttribute("BenListID", tt.ID),
tt.Amount);
Let me know if that solves it for you.
The XElement.SetElementValue Method:
Sets the value of a child element, adds a child element, or removes a
child element.
Also:
The value is assigned to the first child element with the specified
name. If no child element with the specified name exists, a new child
element is added. If the value is null, the first child element with
the specified name, if any, is deleted.
This method does not add child nodes or attributes to the specified
child element.
You should use the XElement.Add Method instead.

Categories