trying to identify text nodes with htmlagility pack

trying to identify text nodes with htmlagility pack - c#

I am trying to identify text nodes from an HTML text having a format like as below
sample text 1 : <strong>[Hot Water][Steam][Electric]</strong> Preheating Coil
sample text 2 : <b><span>[Steam] [Natural Gas Fired] [Electric] [Steam to steam]</span></b><span> Humidifier</span><br>
using the below code
public static string IdentifyHTMLTagsAndRemove(string htmlText)
{
_ = htmlText ?? throw new ArgumentNullException(nameof(htmlText));
var document = new HtmlDocument();
document.LoadHtml(htmlText);
var rootNode = document.DocumentNode;
// get first and last text nodes
var nonEmptyTextNodes = rootNode.SelectNodes("//text()[not(self::text())]") ?? new HtmlNodeCollection(null);
//if (nonEmptyTextNodes.Count == 0)
//{
// return rootNode.OuterHtml;
//}
if (nonEmptyTextNodes.Count > 0)
{
var firstTextNode = nonEmptyTextNodes[0];
var lastTextNode = nonEmptyTextNodes[^1];
// get all br nodes in html string,
var breakNodes = rootNode.SelectNodes("//br") ?? new HtmlNodeCollection(null);
var lastTextNodeLengthIndex = lastTextNode.OuterStartIndex + lastTextNode.OuterLength;
foreach (var breakNode in breakNodes)
{
if (breakNode == null)
continue;
// check index of br nodes against first and last text nodes
// and remove br nodes that sit outside text nodes
if (breakNode.OuterStartIndex <= firstTextNode.OuterStartIndex
|| breakNode.OuterStartIndex >= lastTextNodeLengthIndex)
{
breakNode.Remove();
}
}
}
return rootNode.OuterHtml;
}
But it is constantly failing here
var nonEmptyTextNodes =
rootNode.SelectNodes("//text()[not(self::text())]") ?? new
HtmlNodeCollection(null);
and nonEmptyTextNodes giving count as zero, I am unsure where I am doing wrong with the above code.
Could anyone please point me in the right direction? Many thanks in advance.

In addition to Siebe's answer, I'd also like to point out an inefficiency in the code that trims start/end BR tags. If you look at the HtmlAgilityPack code for HtmlNode operations, you'll see that whenever nodes are removed, the SetChanged() method is called on the parent (and its parent, all the way up). The next time you check the start/end indexes of anything in the tree, they need to be recalculated. So this code could be made to run much faster if you instead just create a temporary list of all the nodes to be removed, then remove them after they've all been identified.
var lastTextNodeLengthIndex = lastTextNode.OuterStartIndex + lastTextNode.OuterLength;
var breakNodesToRemove = rootNode.SelectNodes("//br")?.Where(node => node.OuterStartIndex <= firstTextNode.OuterStartIndex || node.OuterStartIndex >= lastTextNodeLengthIndex).ToList();
breakNodesToRemove?.ForEach(a => a.Remove());
reference: https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack.Shared/HtmlNode.cs

Not sure what you are trying to achieve with
//text()[not(self::text())]
It tries to select text()-nodes that are not text()-nodes. So nothing will be found. If you just use
//text()
Will select all text()-nodes

Related

C# find a specific element with two or more xml files?

I try to explain my problem:
Okay, I need the KSCHL and the Info.
I need the KSCHL from the result file and then I want to search after the KSCHL in the other file "Data".
In the first file I have all KSCHL.
var kschlResultList = docResult.SelectNodes(...);
var kschlDataList = docData.SelectNodes(...);
var infoDataList = docData.SelectNodes(...);
for (int i = 0; i < kschlResultList.Count; i++)
{
string kschlResult = kschlResultList[i].InnerText;
for (int x = 0; x < kschlDataList.Count; x++)
{
string kschlData = kschlDataList[x].InnerText;
if (kschlData == kschlResult)
{
for (int y = 0; y < infoDataList.Count; y++)
{
string infoData = infoDataList[y].InnerText;
if (infoData == kschlResult)
{
//I know the If is false
string infoFromKschl = infoData;
}
}
}
}
}
The problem is now to find the KSCHL (from the first file) in the second file and then to search after the "info".
So if I have the KSCHL "KVZ1" in the first file, then I want to search this KSCHL in the second file and the associated Info for it.
Hope you understand :)

You don't have to loop quite so much. :-)
Using XPath - the special strings inside SelectNodes() or SelectSingleNode(), you can go pretty directly to what you want.
You can see a great basic example - several really - of how to select an XML node based on another node at the same level here:
How to select a node using XPath if sibling node has a specific value?
In your case, we can get to a list of the INFO values more simply by looping just through the KSCHL values. I use them as text, because I want to make a new XPath string with them.
I'm not clear exactly what format you want the results in, so I'm simply pushing them into a SortedDictionary for now.
At that last step, you could do other things as is most useful to you..... such as push them into a database, dump them in a file, send them to another function.
/***************************************************************
*I'm not sure how you want to use the results still,
* so I'll just stick them in a Dictionary for this example.
* ***********************************************************/
SortedDictionary<string, string> objLookupResults = new SortedDictionary<string, string>();
// --- note how I added /text()... doesn't change much, but being specific <<<<<<
var kschlResultList = docresult.SelectNodes("//root/CalculationLogCompact/CalculationLogRowCompact/KSCHL/text()");
foreach (System.Xml.XmlText objNextTextNode in kschlResultList) {
// get the actual text from the XML text node
string strNextKSCHL = objNextTextNode.InnerText;
// use it to make the XPath to get the INFO --- see the [KSCHL/text()= ...
string strNextXPath = "//SNW5_Pricing_JKV-Q10_full/PricingProcedure[KSCHL/text()=\"" + strNextKSCHL + "\" and PRICE>0]/INFO/text()";
// and get that INFO text! I use SelectSingleNode here, assuming only one INFO for each KSCHL..... if there can be more than one INFO for each KSCHL, then we'd need another loop here
string strNextINFO = docdata.SelectSingleNode(strNextXPath)?.InnerText; // <<< note I added the ? because now there may be no result with the rule PRICE>0.
// --- then you need to put this result somewhere useful to you.
// I'm not sure what that is, so I'll stick it in the Dictionary object.
if (strNextINFO != null) {
objLookupResults.Add(strNextKSCHL, strNextINFO);
}
}

C# strange issue - unable to assign value from right to left variable

I have a list Rows which holds 10 different records. I am looping this list in C# console app and inserting values to another list but it only picks first record and inserts it 10 times to new list.
When I debug, unique values are shown in the loop but they are not being assigned to left variable.
List<Job> jobList=new List<Job>();
foreach (var row in rows)
{
Job job = new Job();
job.Title = row.SelectSingleNode("//h2[#class='jobtitle']").ChildNodes[1].Attributes["title"].Value;
job.summary = row.SelectSingleNode("//span[#class='summary']").InnerText
jobList.add(job);
}
Any idea, what is happening?
I also used garbage collector but still no improvement:
job = null;
GC.Collect();
GC.WaitForPendingFinalizers();
Here is updated code after #Andrew suggestion but it didn't work. Right side holds updated values but they are not being assigned to left side variables.
foreach (var row in rows)
{
try
{
var job = new Job();
var title = row.SelectSingleNode("//h2[#class='jobtitle']").ChildNodes[1].Attributes["title"].Value;
var company = row.SelectSingleNode("//span[#class='company']").InnerText.Replace("\n", "").Replace("\r", "");
var location = row.SelectSingleNode("//span[#class='location']").InnerText.Replace("\n", "").Replace("\r", "");
var summary = row.SelectSingleNode("//span[#class='summary']").InnerText.Replace("\n", "").Replace("\r", "");
job.Title = title;
job.Company = company;
job.Location = location;
job.Summary = summary;
jobList.Add(job);
job = null;
GC.Collect();
GC.WaitForPendingFinalizers();
counter++;
Status("Page# " + pageNumber.ToString() + " : Record# " + counter + " extracted");
}
catch (Exception)
{
AppendRecords(jobList);
jobList.Clear();
}
//save file
}

Hi You don't tell us what the rows variable relates to, but I assume these are nodes in a single XmlDocument. The XPath expressions you are using to extract values from these nodes is incorrect, because they will always navigate to the same node in the document irrespective of the current row node.
Here's a simple example that demonstrates the problem:-
static void Main(string[] args)
{
XmlDocument x = new XmlDocument();
x.LoadXml(#"<rows> <row><bla><h2>bob1</h2></bla></row> <row><bla><h2>bob2</h2></bla></row> </rows>");
var rows = x.GetElementsByTagName("row");
foreach (XmlNode row in rows)
{
var h2 = row.SelectSingleNode("//h2").ChildNodes[0].Value;
Console.WriteLine(h2);
}
}
The output from this will be
bob1
bob1
Not what you were expecting? Have a play with the example in Dot Net Fiddle. Take another look at your XPath expression. Your current expression //h2 is saying "give me all h2 elements in the document irrespective of the current node". Whereas .//h2 would give you the h2 elements that are descendants of the current row node, which is probably what you need.

How to write a Numbered List of Text in a specific location?

I need to write an array of string to a numbered list but in a specific location of a document.
For example, the array is:
sentence[0] : Jonathan Spielberg
sentence[1] : Stephanie Black
sentence[2] : Marcus Smith
sentence[3] : Kylie Ashton
...
Then it should be written in a specific location, let's say under the section heading "A. Candidate's Name"
A. Candidate's Name
1. Jonathan Spielberg
2. Stephanie Black
3. Marcus Smith
4. Kylie Ashton
My logic so far is using a unique tags, then it will be replaced and looped by the array to be written on a numbered list. Let's say the unique tag is ######CANDIDATESNAME#####. I've done such way, but that doesn't work.
How am I supposed to do to code this?
P.S. : I have a template document .doc/.docx for the only section headings, then I just need to fill it with the numbered list.

I would suggest you following solution.
1) Implement IReplacingCallback interface.
2) Use Range.Replace method to find the unique tag.
3) Move the cursor to the text (unique tag) and insert the numbered list.
Please read following documentation link and use following code to insert numbered list at the position of unique tag.
Find and Replace
string[] list = new string[] { "Jonathan Spielberg", "Stephanie Black", "Marcus Smith", "Kylie Ashton" };
Document mainDoc = new Document(MyDir + "in.docx");
mainDoc.Range.Replace(new Regex("######CANDIDATESNAME#####"), new FindandInsertList(list), false);
mainDoc.Save(MyDir + " Out.docx");
//--------------------------------------
public class FindandInsertList : IReplacingCallback
{
private string[] listitems;
public FindandInsertList(string[] list)
{
listitems = list;
}
ReplaceAction IReplacingCallback.Replacing(ReplacingArgs e)
{
// This is a Run node that contains either the beginning or the complete match.
Node currentNode = e.MatchNode;
// The first (and may be the only) run can contain text before the match,
// in this case it is necessary to split the run.
if (e.MatchOffset > 0)
currentNode = SplitRun((Run)currentNode, e.MatchOffset);
// This array is used to store all nodes of the match for further removing.
ArrayList runs = new ArrayList();
// Find all runs that contain parts of the match string.
int remainingLength = e.Match.Value.Length;
while (
(remainingLength > 0) &&
(currentNode != null) &&
(currentNode.GetText().Length <= remainingLength))
{
runs.Add(currentNode);
remainingLength = remainingLength - currentNode.GetText().Length;
// Select the next Run node.
// Have to loop because there could be other nodes such as BookmarkStart etc.
do
{
currentNode = currentNode.NextSibling;
}
while ((currentNode != null) && (currentNode.NodeType != NodeType.Run));
}
// Split the last run that contains the match if there is any text left.
if ((currentNode != null) && (remainingLength > 0))
{
SplitRun((Run)currentNode, remainingLength);
runs.Add(currentNode);
}
// Create Document Buidler
DocumentBuilder builder = new DocumentBuilder(e.MatchNode.Document as Document);
builder.MoveTo((Run)runs[runs.Count - 1]);
builder.ListFormat.List = e.MatchNode.Document.Lists.Add(ListTemplate.NumberDefault);
foreach (string item in listitems)
{
builder.Writeln(item);
}
// End the bulleted list.
builder.ListFormat.RemoveNumbers();
// Now remove all runs in the sequence.
foreach (Run run in runs)
run.Remove();
// Signal to the replace engine to do nothing because we have already done all what we wanted.
return ReplaceAction.Skip;
}
private static Run SplitRun(Run run, int position)
{
Run afterRun = (Run)run.Clone(true);
afterRun.Text = run.Text.Substring(position);
run.Text = run.Text.Substring(0, position);
run.ParentNode.InsertAfter(afterRun, run);
return afterRun;
}
}
I work with Aspose as Developer evangelist.

Looping throught XML element to add data

I dont know how exactly to word my question, so apologies from up front. I have an xml file and it has elements like the following:
- <Allow_BenGrade>
<Amount BenListID="0">0</Amount>
</Allow_BenGrade>
- <Add_Earnings_NonTaxable>
<Amount AddEarnID="0">0</Amount>
</Add_Earnings_NonTaxable>
I am interested in Allow_BenGrade where i can add multiple elements inside there. I have list of 3 items but when I loop through to write it to the file, it only writes the last item in the list, so instead of have 3 elements inside Allow_BenGrade, i end up having one (last one in the item list). My code is below. Please help thank you.
var query = from nm in xelement.Elements("EmployeeFinance")
select new Allowance {
a_empersonalID = (int)nm.Element("EmpPersonal_Id"),
a_allbengradeID = (int)nm.Element("Grade_Id")
};
var x = query.ToList();
foreach (var xEle in x)
{
var qryBenListGrade = from ee in context.Employee_Employ
join abg in context.All_Inc_Ben_Grade
on ee.Grade_Id equals abg.GradeID
join abl in context.All_Inc_Ben_Listing
on abg.All_Inc_Ben_ListingID equals abl.ID
where ee.Employee_Personal_InfoEmp_id == xEle.a_empersonalID && abg.GradeID == xEle.a_allbengradeID && (abl.Part_of_basic == "N" && abl.Status == "A" && abl.Type_of_earnings == 2)
//abl.Approved_on !=null &&
select new
{
abl.ID,
abl.Amount,
abg.GradeID,
ee.Employee_Personal_InfoEmp_id,
abl.Per_Non_Taxable,
abl.Per_Taxable
};
var y = qryBenListGrade.ToList();
//xEle.a_Amount = 0;
foreach (var tt in y)
{
Debug.WriteLine("amount: " + tt.Amount + " emp id: " + tt.Employee_Personal_InfoEmp_id + " ben list id: " + tt.ID);
// xEle.a_Amount = xEle.a_Amount + tt.Amount;
var result = from element in doc.Descendants("EmployeeFinance")
where int.Parse(element.Element("EmpPersonal_Id").Value) == tt.Employee_Personal_InfoEmp_id
select element;
foreach (var ele in result)
{
ele.Element("Allow_BenGrade").SetElementValue("Amount", tt.Amount);
//ele.Element("Allow_BenGrade").Element("Amount").SetAttributeValue("BenListID", tt.ID);
}
}
doc.Save(GlobalClass.GlobalUrl);
}

SetElementValue will, as the name suggests, set the value of the Amount element... You need to Add a new one instead:
ele.Element("Allow_BenGrade").Add(new XElement("Amount",
new XAttribute("BenListID", tt.ID),
tt.Amount);
Let me know if that solves it for you.

The XElement.SetElementValue Method:
Sets the value of a child element, adds a child element, or removes a
child element.
Also:
The value is assigned to the first child element with the specified
name. If no child element with the specified name exists, a new child
element is added. If the value is null, the first child element with
the specified name, if any, is deleted.
This method does not add child nodes or attributes to the specified
child element.
You should use the XElement.Add Method instead.

Iterate all 'select' elements and get all their values in Selenium

I have the following code in C# using selenium:
private void SelectElementFromList(string label)
{
var xpathcount = selenium.GetXpathCount("//select");
for (int i = 1; i <= xpathcount; ++i)
{
string[] options;
try
{
options = selenium.GetSelectOptions("//select["+i+"]");
}
catch
{
continue;
}
foreach (string option in options)
{
if (option == label)
{
selenium.Select("//select[" + i + "]", "label=" + label);
return;
}
}
}
}
The problem is the line:
options = selenium.GetSelectOptions("//select["+i+"]");
When i == 1 this works, but when i > 1 the method return null ("ERROR: Element //select[2] not found"). It works only when i == 1.
I have also tried this code in JS:
var element = document.evaluate("//select[1]/option[1]/#value", document, null, XPathResult.ANY_TYPE, null);
alert(element.iterateNext());
var element = document.evaluate("//select[2]/option[1]/#value", document, null, XPathResult.ANY_TYPE, null);
alert(element.iterateNext());
Which print on the screen "[object Attr]" and then "null".
What am I doing wrong?
My goal is to iterate all "select" elements on the page and find the one with the specified label and select it.

This is the second most FAQ in XPath (the first being unprefixed names and default namespace.
In your code:
options = selenium.GetSelectOptions("//select["+i+"]");
An expression of the type is evaluated:
//select[position() =$someIndex]
which is a synonym for:
//select[$someIndex]
when it is known that $someIndex has an integer value.
However, by definition of the // XPath pseudo-operator,
//select[$k]
when $k is integer, means:
"Select all select elements in the document that are the $k-th select child of their parent."
When i == 1 this works, but when i > 1 the method return null ("ERROR:
Element //select[2] not found"). It works only when i == 1.
This simply means that in the XML document there is no element that has more than one select child.
This is a rule to remember: The [] XPath operator has higher precedence (priority) than the // pseudo-operator.
The solution: As always when we need to override the default precedence of operators, we must use brackets.
Change:
options = selenium.GetSelectOptions("//select["+i+"]");
to:
options = selenium.GetSelectOptions("(//select)["+i+"]");

Finally I've found a solution.
I've just replaced these lines
options = selenium.GetSelectOptions("//select["+i+"]");
selenium.Select("//select["+i+"]", "label="+label);
with these
options = selenium.GetSelectOptions("//descendant::select[" + i + "]");
selenium.Select("//descendant::select[" + i + "]", "label=" + label);

The above solution options = selenium.GetSelectOptions("(//select)["+i+"]"); doesn't worked for me but i tried to use css selectors.
I want to get username and password text box. I tried with css=input this gave me Username text box and when used css=input+input this gave me Password textbox.
along with this selectors you can use many things in combination.
here is the link from where i read.
I think this will help u to achieve your target.
Regards.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

trying to identify text nodes with htmlagility pack - c#

Not sure what you are trying to achieve with //text()[not(self::text())] It tries to select text()-nodes that are not text()-nodes. So nothing will be found. If you just use //text() Will select all text()-nodes

Related

C# find a specific element with two or more xml files?

C# strange issue - unable to assign value from right to left variable

How to write a Numbered List of Text in a specific location?

Looping throught XML element to add data

Iterate all 'select' elements and get all their values in Selenium

Categories

Resources