XPath, select multiple elements from multiple nodes in HTML - c#

I just can't figure this one.
I have to search through all nodes that have classes with "item extend featured" values in it (code below). In those classes I need to select every InnerText of <h2 class="itemtitle"> and href value in it, plus all InnerTexts from <div class="title-additional">.
<li class="item extend featured">
<div class="title-box">
<h2 class="itemtitle">
<a target="_top" href="www.example.com/example1/example2/exammple4/example4" title="PC Number 1">PC Number 1</a>
</h2>
<div class="title-additional">
<div class="title-km">150 km</div>
<div class="title-year">2009</div>
<div class="title-price">250 €</div>
</div>
The output should be something like this:
Title:
href:
Title-km:
Title-year:
Title-Price:
--------------
Title:
href:
Title-km:
Title-year:
Title-Price:
--------------
So, the question is, how to traverse through all "item extend featured" nodes in html and select items I need above from each node?
As I understand, something like this should work but it breaks halfway
EDIT: I just noticed, there are ads on the site that share the exact same class and they obviously don't have the elements I need. More problems to think about.
var items1 = htmlDoc.DocumentNode.SelectNodes("//*[#class='item extend featured']");
foreach (var e in items1)
{
var test = e.SelectSingleNode(".//a[#target='_top']").InnerText;
Console.WriteLine(test);
}

var page = new HtmlDocument();
page.Load(path);
var lists = page.DocumentNode.SelectNodes("//li[#class='item extend featured']");
foreach(var list in lists)
{
var link = list.SelectSingleNode(".//*[#class='itemtitle']/a");
string title = link.GetAttributeValue("title", string.Empty);
string href = link.GetAttributeValue("href", string.Empty);
string km = list.SelectSingleNode(".//*[#class='title-km']").InnerText;
string year = list.SelectSingleNode(".//*[#class='title-year']").InnerText;
string price = list.SelectSingleNode(".//*[#class='title-price']").InnerText;
Console.WriteLine("Title: %s\r\n href: %s\r\n Title-km: %s\r\n Title-year: %s\r\n Title-Price: %s\r\n\r\n", title, href, km, year, price);
}

What you are trying to achieve requires multiple XPath expressions as you can't return multiple results at different levels using one query (unless you use Union perhaps).
What you might be looking for is something similar to this:
var listItems = htmlDoc.DocumentNode.SelectNodes("//li[#class='item extend featured']");
foreach(var li in listItems) {
var title = li.SelectNodes("//h2/a/text()");
var href = li.SelectNodes("//h2/a/#href");
var title_km = li.SelectNodes("//div[#class='title-additional']/div[#class='title-km']/text()");
var title_... // other divs
}
Note: code not tested

Related

2sxc Get list of fields of an entity

Let's say I have an entity "Cars" with two fields "Brand" and "Model".
In a c# template, is it possible to dynamically get the name of the fields inside "Cars" to a list? The output would meed to be {"Brand", "Model"}.
Even further, is it possible to get the description and a specific translation of the field name and description?
Using Daniel's comments and circling around to just answer your original question, here it is simplified and things are split up a little to see the parts:
#inherits ToSic.Sxc.Dnn.RazorComponent
#using Newtonsoft.Json
#{
var myData = AsList(Data);
var myDatum = AsEntity(myData.First());
var myFieldNames = (myDatum.Type.Attributes as IEnumerable<dynamic>).Select(a => a.Name);
}
<pre>
myDatum.Type.Name = #myDatum.Type.Name
myFieldNames = #JsonConvert.SerializeObject(myFieldNames)
</pre>
Which then outputs just:
myDatum.Type.Name = Cars
myFieldNames = ["Name","Brand","Model"]
I think what you are trying to do is covered in the tutorials pretty well.
https://2sxc.org/dnn-tutorials/en/razor
In particular take a look at the LINQ examples; numbers 6, 7, and 8.
I #Joao
Basically you should check Jeremys answer for most of your question.
I believe you're also asking about showing the labels like Brand in the Razor using the field-label from the ContentType specs. This is possible, but it's a bit harder as it's not a common use case. So let me just point you in the right direction...
Each entity has a property called Type. In Razor you would get this using
var someType = AsEntity(yourThing).Type;
This is an IContentType https://docs.2sxc.org/api/dot-net/ToSic.Eav.Data.IContentType.html.
To get the properties and the names of them you would go to
var attr = someType.Attributes["TheName"];
which gives you an IContentTypeAttribute https://docs.2sxc.org/api/dot-net/ToSic.Eav.Data.IContentTypeAttribute.html
This has Metadata - so
var attr = someType.Attributes["TheName"].Metadata;
The metadata is an IMetadataOf https://docs.2sxc.org/api/dot-net/ToSic.Eav.Metadata.IMetadataOf.html
So using this you can find everything you want - but as you can see it's quite a hoop to jump through.
Here is a simple working example. I am sorta hoping Daniel chimes in and reveals an easy way to go from myType.Attributes and convert straight to a Json string??
Create a new View, Enable List, point it to your Cars Content-Type, fix the last few lines so that the "And the data..." part matches your CT's actual fields.
#inherits ToSic.Sxc.Dnn.RazorComponent
#{
var myData = AsList(Data);
var myType = AsEntity(myData.First()).Type;
var myFields = new List<string>();
foreach(var field in myType.Attributes) {
myFields.Add(field.Name);
}
}
<div #Edit.TagToolbar(Content)>
<h3>View Heading</h3>
<h4>Table (Content Type) Name: #myType.Name</h4>
<p>has the following fields</p>
<div class="d-flex flex-row bd-highlight mb-3">
#foreach(var field in myType.Attributes) {
<p class="p-2 bd-highlight"><strong>#field.Name</strong></p>
}
</div>
<h4>As a comma separated list?</h4>
<p>#string.Format("{{\"{0}\"}}", string.Join("\",\"", myFields))</p>
<h4>And the data...</h4>
#foreach(var cont in AsList(Data)) {
<div class="d-flex flex-row bd-highlight mb-3"
#Edit.TagToolbar(cont)>
<div class="p-2 bd-highlight">#cont.EntityId</div>
<div class="p-2 bd-highlight">#cont.Name</div>
<div class="p-2 bd-highlight">#cont.Brand</div>
<div class="p-2 bd-highlight">#cont.Model</div>
</div>
}
</div>
The Cars Content type with only 3 fields
And the output of the View looks like this
For an update on what I needed to achieve, this outputs the field name and the label for an easy loop:
var myData = AsList(App.Data["Stages"]);
var myDatum = AsEntity(myData.First());
var myFields = (myDatum.Type.Attributes as IEnumerable<dynamic>);
// var myFieldNames = myFields.Select(a => a.Name);
// var myFieldLabels = myFields.Select(a => (a.Metadata as IEnumerable<dynamic>).First().Title.TypedContents);
var myFieldNamesAndLabels = myFields.Select(i => new
{
i.Name,
(i.Metadata as IEnumerable<dynamic>).First().GetBestTitle()
});
If there is an easier way to achieve this, please let me know.
Thanks #Jeremy Farrance and #iJungleBoy

Verify Order Of HTML Elements With Attribute Values Such as Class="Group0-Item1" Class="Group0-Item2" Class="Group1-Item1"

In my Selenium/C#/NUNIT project, I need to find a way to validate the order (top down hierarchy of a page's HTML) for a group of HTML elements (as well as the elements contained within those groups). These are my elements that show inside my page's HTML...
<div class="gapBanner-banner-order1-group0"></div>
<div class="gapBanner-banner-order1-group1"></div>
<div class="gapBanner-banner-order1-group2"></div>
<div class="gapBanner-banner-order2-group2"></div>
The validation I want to perform should be able to catch the following bugs:
Bug 1: The groups are not in order within the page's HTML. One of the elements that is in group1 appears first in the HTML before group0...
<div class="gapBanner-banner-order1-group1"></div>
<div class="gapBanner-banner-order1-group0"></div>
<div class="gapBanner-banner-order1-group2"></div>
<div class="gapBanner-banner-order2-group2"></div>
Bug #2: The elements WITHIN each group are not in order within the page's HTML. Group2-Order2 appears before Group2-Order1 within the HTML
<div class="gapBanner-banner-order1-group0"></div>
<div class="gapBanner-banner-order1-group1"></div>
<div class="gapBanner-banner-order2-group2"></div>
<div class="gapBanner-banner-order1-group2"></div>
The below is what I have coded so far, but it is definitely not going to do the job, not to mention, it is very messy. I cant figure out what kind of logic I need for this
/// 5. Verify the correct order of elements in which they appear inside the HTML
List<IWebElement> CustomPageHTMLComponents = Browser.
FindElements(By.XPath("//div[contains(#class, 'group')")).ToList();
List<IWebElement> uniqueGroups = new List<IWebElement>();
// Get the unique groups
for (int i = 0; i < CustomPageHTMLComponents.Count; i++)
{
IWebElement currentComponent = Browser.FindElements(By.XPath("//div[contains(#class, 'group')"))[i];
string toBeSearched = "group";
string currentComponenetClassAttributeValue = currentComponent.GetAttribute("class");
int x = currentComponenetClassAttributeValue.IndexOf(toBeSearched);
string groupNumber = currentComponenetClassAttributeValue.Substring(x + toBeSearched.Length);
if (groupNumber == i.ToString())
{
uniqueGroups.Add(currentComponent);
}
}
// Some kind of logic to verify everything???
for (int i = 0; i < Page.CustomPageHTMLComponents.Count; i++)
{
IWebElement currentComponent = Browser.FindElements(By.XPath("//div[contains(#class, 'group')"))[i];
string toBeSearched = "group";
string currentComponenetClassAttributeValue = currentComponent.GetAttribute("class");
int x = currentComponenetClassAttributeValue.IndexOf(toBeSearched);
string groupNumber = currentComponenetClassAttributeValue.Substring(x + toBeSearched.Length);
Assert.AreEqual(groupNumber, i.ToString());
}
There are probably a number of ways to do this. This is the first way I came up with...
Grab all the class names from the desired elements and store them in string array #1
Make a copy of string array #1 and sort it
Compare the two arrays and if they are equal, then they were sorted to start with
I've checked the HTML you provided for the bugs you'd like to catch and it catches them all. The one issue I can think of is if you get more than 9 orders or groups the sorting will not be what you want because it's alpha order not numerical order, e.g. 1, 10, 2, ... instead of 1, 2, ... 10.
// capture the class names from the desired classes
string[] elements = _driver.FindElements(By.CssSelector("div[class^='gapBanner-banner-']")).Select(e => e.GetAttribute("class")).ToArray();
// make a copy of the array
string[] sortedElements = new string[elements.Length];
elements.CopyTo(sortedElements, 0);
// sort the copy
Array.Sort(sortedElements);
// compare the arrays for order using NUnit CollectionAssert
CollectionAssert.AreEqual(elements, sortedElements, "Verify ordering of elements");

Is there a way of scraping a list of HTML elements based on key words where the document structure is indeterminate?

I am building a scraper to be used on many sites (too many to scrape manually using a web scraping tool such as Octoparse).
Each site will probably be different in structure. Some sites may have data that I wish to be scraped; some may not. This is to be determined using a list of keywords/keyphrases. Of sites that I wish data to be parsed, these are likely to be presented in a list of some way. However, the HTML elements used to present the list is indeterminate (i.e. could be a ul list, li list, a div list, a table, etc).
If a keyword/keyphrase is found, I wish for not only that element to be parsed, but all others that may be part of the same list/group.
Example 1
<div>
<h1>Random content I am not interested in</h1>
</div>
<div>
<h1>Some more random content I am not interested in</h1>
</div>
<div>
<ul>
<li>Dogs</li>
<li>Cats</li>
<li>Birds</li>
</ul>
</div>
Example 2
<div>
<h1>Random content I am not interested in</h1>
</div>
<div>
<h1>Some more random content I am not interested in</h1>
</div>
<div>
<div>
<div>
<div>
<h1>Bob</h1>
<p>A description of Bob</p>
</div>
<div>
<h1>Ben</h1>
<p>A description of Ben</p>
</div>
<div>
<h1>Bill</h1>
<p>A description of Bill</p>
</div>
</div>
</div>
</div>
From example one, if I had identified the element Dogs, I would like the result to be Dogs, Cats, Birds.
From example two, if I had identified Ben, I would like the result to be 3 div elements, each of which contains the heading and paragraph; the key is that all results are to include HTML, not just text.
Any help/guidance would be much appreciated.
I managed something like this:
static IEnumerable<string> FindSimilarItems(string html, string[] values, int maxDepth)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var output = new List<string>();
foreach (var value in values)
{
var rootElement = doc.DocumentNode.SelectSingleNode($"//*[text()='{value}']");
if (rootElement == null) continue;
for (int i = 0; i < maxDepth; i++)
{
var newXpath = RemoveXpathGroupIndex(rootElement.XPath, i);
var newElements = doc.DocumentNode.SelectNodes(newXpath);
if (newElements.Count <= 1) continue;
output.AddRange(newElements.Select(x => x.InnerText));
}
}
return output.GroupBy(x => x).Select(x => x.First()).ToList();
}
static string RemoveXpathGroupIndex(string xpath, int groupElement)
{
var splited = xpath.Split('/');
var pickedElement = splited.Length - 1 - groupElement;
splited[pickedElement] = splited[pickedElement].Substring(0, splited[pickedElement].IndexOf('['));
return string.Join("/", splited);
}
This code:
var similarItems = FindSimilarItems(input1, new string[] { "Dogs" }, 3);
Will return
["Dogs", "Cats", "Birds"]

How to parse text from anonymous block in AngleSharp?

I'm parsing site content using AngleSharp and i've got an issue with anonymous block.
See the sample code:
var parser = new HtmlParser();
var document = parser.Parse(#"<body>
<div class='product'>
<a href='#'><img src='img1.jpg' alt=''></a>
Hello, world
<div class='comments-likes'>1</div>
</div>
<div class='product'>
<a href='#'><img src='img2.jpg' alt=''></a>
Yet another helloworld
<div class='comments-likes'>25</div>
</div>
<body>");
var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
var productTitle = product.Text();
productTitle.Dump();
}
So, productTitle contains numbers from div.comments-likes, output is:
Hello, world 1
Yet another helloworld 25
I've tried something like product.FirstElementChild.NextElementSibling.Text(); but next sibling for link element is div.comments-likes, not anonymous block. It shows:
1
25
So, anonymous blocks are skipped. :(
The best workaround i've found is deleting all preventing blocks, for my example:
product.QuerySelector(".comments-likes").Remove();
var productTitle = product.Text().Trim();
Is better way for parsing text from anonymous block?
Text is modeled as a TextNode, it is a type of node beside element, comment node, processing instruction, etc. That's why NextElementSibling you tried didn't include the text in the result since it intended to return elements only, as the name suggests.
You can get text nodes located directly within product div by traversing through the div's ChildNodes and then filter by NodeType, for example :
var products = document.QuerySelectorAll("div.product");
foreach (var product in products)
{
var productTitle = product.ChildNodes
.First(o => o.NodeType == AngleSharp.Dom.NodeType.Text
&& o.TextContent.Trim() != "");
Console.WriteLine(productTitle.TextContent.Trim());
}
dotnetfiddle demo
Notice that newlines between elements are also text nodes, so we need to filter those out in the demo above.

How can I make sure that that the number displays correctly on each grid that gets added?

I am using MVC + EF
I have a Feed xml file url that gets updated every 7 minute with items, every time a new item gets added I retrieve all the items to a list variable and then I add these varible to my database table. After that I fill a new list variable which is my ViewModel from the database table. Then I declare the ViewModel inside my view which is a .cshtml file and loop throught all of the objects and display them.
How can I make sure that the newest items get placed on the top and not in the bottom and also the numbers displays in correct order?
This is how I display the items inside my cshtml note that I use a ++number so the newest item needs to be 1 and so on ::
#model Project.Viewmodel.ItemViewModel
#{
int number = 0;
}
<div id="news-container">
#foreach (var item in Model.NewsList.OrderByDescending(n => n.PubDate))
{
<div class="grid">
<div class="number">
<p class="number-data">#(++number)</p>
</div>
<p class="news-title">#(item.Title)</p>
<div class="item-content">
<div class="imgholder">
<img src="#item.Imageurl" />
<p class="news-description">
#(item.Description)
<br />#(item.PubDate) |
Source
</p>
</div>
</div>
</div>
}
</div>
This is how I fill the viewmodel which I use inside the .cshtml file to iterate throught and display the items
private void FillProductToModel(ItemViewModel model, News news)
{
var productViewModel = new NewsViewModel
{
Description = news.Description,
NewsId = news.Id,
Title = news.Title,
link = news.Link,
Imageurl = news.Image,
PubDate = news.Date,
};
model.NewsList.Add(productViewModel);
}
If you check this image thats how it gets displayed with the numbers, thats incorrect.
If you see the arrows thats how it should be, how can I accomplish that?
Any kind of help is appreciated :)
note: When I remove .OrderByDescending, the numbers are correctly on each grid. But I need the .OrderByDescending beacuse i want the latest added item in the top.
Try this:
#model Project.Viewmodel.ItemViewModel
#{
int number = 0;
var NewsItems=Model.NewsList.OrderByDescending(n => n.PubDate).ToList();
}
<div id="news-container">
#foreach (var item in NewsItems)
{
<div class="grid">
<div class="number">
<p class="number-data">#(++number)</p>
</div>
<p class="news-title">#(item.Title)</p>
<div class="item-content">
<div class="imgholder">
<img src="#item.Imageurl" />
<p class="news-description">
#(item.Description)
<br />#(item.PubDate) |
Source
</p>
</div>
</div>
</div>
}
</div>
Looking at your sketch I assume you have float: left or display: inline-block for a grid class. Adding float: right might do the trick.
If that does not help please post CSS you have.
just a quick word..
you are passing NewsViewModel to the view and performing iteration on ItemViewModel ..y?
do u think this may be the cause of the problem..
Regards
You could sort your news list using the CompareTo method:
model.NewsList.Sort((a, b) => b.PubDate.Date.CompareTo(a.PubDate.Date));
Once you have the list sorted correctly, you can simply use CSS to display the news list two items per row. See this fiddle.
The fiddle is a revised one which was provided to me in a similar question I asked before.
Try this one
private void FillProductToModel(ItemViewModel model, News news)
{
var newList = list.OrderByDescending(x => x.News.Date).toList();
var productViewModel = new NewsViewModel
{
Description = newList .Description,
NewsId = newList .Id,
Title = newList .Title,
link = newList .Link,
Imageurl = newList .Image,
PubDate = newList .Date,
};
model.NewsList.Add(productViewModel);

Categories