HtmlAgilityPack replace node - c#

I want to replace a node with a new node. How can I get the exact position of the node and do a complete replace?
I've tried the following, but I can't figured out how to get the index of the node or which parent node to call ReplaceChild() on.
string html = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);
var bolds = document.DocumentNode.Descendants().Where(item => item.Name == "b");
foreach (var item in bolds)
{
string newNodeHtml = GenerateNewNodeHtml();
HtmlNode newNode = new HtmlNode(HtmlNodeType.Text, document, ?);
item.ParentNode.ReplaceChild( )
}

To create a new node, use the HtmlNode.CreateNode() factory method, do not use the constructor directly.
This code should work out for you:
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodeStr = "<foo>bar</foo>";
var newNode = HtmlNode.CreateNode(newNodeStr);
item.ParentNode.ReplaceChild(newNode, item);
}
Note that we need to call ToList() on the query, we will be modifying the document so it would fail if we don't.
If you wish to replace with this string:
"some text <b>node</b> <strong>another node</strong>"
The problem is that it is no longer a single node but a series of nodes. You can parse it fine using HtmlNode.CreateNode() but in the end, you're only referencing the first node of the sequence. You would need to replace using the parent node.
var htmlStr = "<b>bold_one</b><strong>strong</strong><b>bold_two</b>";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var query = doc.DocumentNode.Descendants("b");
foreach (var item in query.ToList())
{
var newNodesStr = "some text <b>node</b> <strong>another node</strong>";
var newHeadNode = HtmlNode.CreateNode(newNodesStr);
item.ParentNode.ReplaceChild(newHeadNode.ParentNode, item);
}

Have Implemented the following solution to achieve the same.
var htmlStr = "<b>bold_one</b><div class='LatestLayout'><div class='olddiv'><strong>strong</strong></div></div><b>bold_two</b>";
var htmlDoc = new HtmlDocument();
HtmlDocument document = new HtmlDocument();
document.Load(htmlStr);
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='olddiv']").Remove();
htmlDoc.DocumentNode.SelectSingleNode("//div[#class='LatestLayout']").PrependChild(newChild)
htmlDoc.Save(FilePath); // FilePath .html file with full path if need to save file.
so selecting an object and removing respective HTML object
and appending it as chile. of respective object.

Related

How to seperate strings from a serialized XML node

I have an serialized XML file. This shows the relevant part:
I am reading this XML file with this (code snippet):
temp = Path.GetFileNameWithoutExtension(s);
var document = new XmlDocument();
document.Load(s);
var root = document.DocumentElement;
var node = root["ScenarioDescription"];
var text = node?.InnerText;
var ArmyNode = root["ArmyFiles"];
var ArmyText = ArmyNode?.InnerText;
However, ArmyText returns the concatenation of all three strings that make up the ArmyFiles node. I need them as three separate strings. How can I do this?
This code works to read all the strings in the node and place them into a list:
foreach (XmlElement A in ArmyNode)
{
var ArmyTemp = A.InnerText;
ArmyList.Add(ArmyTemp);
}
var ArmyText = ArmyNode?.InnerText;

Parse HTML class in individual items with htmlagilitypack

I want to parse HTML, I used the following code but I get all of it in one item instead of getting the items individually
var url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
var web = new HtmlWeb();
var doc = web.Load(url);
IEnumerable<HtmlNode> nodes =
doc.DocumentNode.Descendants()
.Where(n => n.HasClass("search-result"));
foreach (var item in nodes)
{
string itemx = item.SelectSingleNode(".//a").Attributes["href"].Value;
MessageBox.Show(itemx);
MessageBox.Show(item.InnerText);
}
I only receive 1 message for the first item and the second message displays all items
When you search the data from the url based on class 'search-result', there is only one node that is returned. Instead of iterating through its children, you only go through that one div, which is why you are only getting one result.
If you want to get a list of all the links inside the div with class "search-result", then you can do the following.
Code:
string url = "https://subscene.com/subtitles/searchbytitle?query=joker&l=";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
List<string> listOfUrls = new List<string>();
HtmlNode searchResult = doc.DocumentNode.SelectSingleNode("//div[#class='search-result']");
// Iterate through all the child nodes that have the 'a' tag.
foreach (HtmlNode node in searchResult.SelectNodes(".//a"))
{
string thisUrl = node.GetAttributeValue("href", "");
if (!string.IsNullOrEmpty(thisUrl) && !listOfUrls.Contains(thisUrl))
listOfUrls.Add(thisUrl);
}
What does it do?
SelectSingleNode("//div[#class='search-result']") -> retrieves the div that has all the search results and ignores the rest of the document.
Iterates through all the "subnodes" only that have href in it and adds it to a list. Subnodes are determined based on the dot notation SelectNodes(".//a") (Instead of .//, if you do //, it will search the entire page which is not what you want).
If statement makes sure its only adding unique non-null values.
You have all the links now.
Fiddle: https://dotnetfiddle.net/j5aQFp
I think it's how you're looking up and storing the data. Try:
foreach (HtmlNode link doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue( "href", string.Empty );
MessageBox.Show(hrefValue);
MessageBox.Show(link.InnerText);
}

Trouble reading iTunes XML feed

I am trying to read an XML feed from http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml.
I want to access the fields like this:
<im:price amount="1.29000" currency="USD">$1.29</im:price>
<im:releaseDate label="December 31, 1960">1960-12-31T16:00:00-07:00</im:releaseDate>
Here is what I have done so far:
var xml = "http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml";
XmlDocument doc = new XmlDocument();
doc.Load(xml);
XmlNodeList items = doc.SelectNodes("//entry");
foreach (var item in items) {
// Do something with item.
}
No luck, though. items is null. Why? What am I doing wrong?
You need to create a namespace manager to map the RSS and also the iTunes custom tags namespace URIs to short prefixes (itunes and im in the example below):
var xml = "http://itunes.apple.com/us/rss/topsongs/limit=10/genre=2/xml";
XmlDocument doc = new XmlDocument();
doc.Load(xml);
var namespaceManager = new XmlNamespaceManager(doc.NameTable);
namespaceManager.AddNamespace("itunes", "http://www.w3.org/2005/Atom");
namespaceManager.AddNamespace("im", "http://itunes.apple.com/rss");
XmlNodeList items = doc.SelectNodes("//itunes:entry", namespaceManager);
foreach (XmlNode item in items)
{
var price = item.SelectSingleNode("im:price", namespaceManager);
var releaseDate = item.SelectSingleNode("im:releaseDate", namespaceManager);
if (price != null)
{
Console.WriteLine(price.Attributes["amount"].InnerText);
}
if (releaseDate != null)
{
Console.WriteLine(releaseDate.Attributes["label"].InnerText);
}
}
For that specific feed you should get 10 entries.
It's in the docs as well:
If the XPath expression does not include a prefix, it is assumed that
the namespace URI is the empty namespace. If your XML includes a
default namespace, you must still use the XmlNamespaceManager and add
a prefix and namespace URI to it; otherwise, you will not get any
nodes selected. For more information, see Select Nodes Using XPath
Navigation.
Alternatively you can use a namespace-agnostic XPath (from here):
XmlNodeList items = doc.SelectNodes("//*[local-name() = 'entry']");
Finally, not sure why you said items is null. It cannot be. When running your original code you should get this:

HTMLAgility:Replace two nodes with a new element

I am looping through a nodes collection. I have to replace the current node and sibling of the current node with a new element.
I have written the below code for doing that:
private void modifyNodes(IEnumerable<HtmlNode> selectedNodes)
{
foreach (var node in selectedNodes)
{
node.NextSibling.SetAttributeValue("style", "font-weight:bold;padding:2px 2px;");
node.SetAttributeValue("style", "float:right;");
var parentNode = node.ParentNode;
var doc = new HtmlDocument();
var newElement = doc.CreateElement("table");
newElement.SetAttributeValue("style", "background-color:#e4ecf8;width:100%");
var sectionRow = doc.CreateElement("tr");
var headerColumn = doc.CreateElement("td");
headerColumn.AppendChild(node.NextSibling);
var weightColumn = doc.CreateElement("td");
weightColumn.AppendChild(node);
sectionRow.AppendChild(headerColumn);
sectionRow.AppendChild(weightColumn);
newElement.AppendChild(sectionRow);
element.ParentNode.RemoveChild(node);
parentNode.ReplaceChild(newElement, node.NextSibling);
}
}
This is adding the new element and removing the passed node. But it's failing to remove the next sibling of the node. What am I doing wrong here.
Please help.
You're explicitly replaced node.NextSibling with the newElement, as you said that the new element was added. The problem may be in the type of the next sibling. Most probably, this is a text node (very often those \r\n which divide the HTML nodes).
So it seems, that your new node just replaced the text node, and the result is a bit unexpected. So if this is a really an issue, you could do a workaround like this:
// next sibling
var next = node.NextSibling;
// get the first non-text node
while (next != null && next is HtmlTextNode)
next = next.NextSibling;
var newNode = doc.CreateElement(...);
// replace the current node with the new one
current.ParentNode.ReplaceChild(newNode, current);
// remove the next node if it was found
if (next != null)
next.Remove();

Parsing Field Values from Sharepoint List Services Lists.GetList

I'm trying to write something that quickly will grab field values (e.g. combo box, lookups, etc) using the Sharepoint Web Services. The following code works, but is slow and seems inefficient. Is there any way to turn this into a LINQ style query with XDocument/XElement? When I try to Parse the OuterXml it seems to load incorrectly.
MSDN - Lists.GetList
ProuductionResultNode = listservice.GetList(productiontable_listGUID);
XmlDocument doc = new XmlDocument();
doc.LoadXml(ProuductionResultNode.OuterXml);
XmlNamespaceManager mg = new XmlNamespaceManager(doc.NameTable);
mg.AddNamespace("sp", "http://schemas.microsoft.com/sharepoint/soap/");
mg.AddNamespace("z", "#RowsetSchema");
mg.AddNamespace("rs", "urn:schemas-microsoft-com:rowset");
mg.AddNamespace("y", "http://schemas.microsoft.com/sharepoint/soap/ois");
mg.AddNamespace("w", "http://schemas.microsoft.com/WebPart/v2");
mg.AddNamespace("d", "http://schemas.microsoft.com/sharepoint/soap/directory");
XmlNodeList FieldsInList = doc.SelectNodes("//sp:Field", mg);
foreach (XmlNode Field in FieldsInList)
{
if (Field.HasChildNodes)
{
if (Field.Attributes["Name"].Value == fieldNameInternal)
{
foreach (XmlNode node in Field.ChildNodes)
{
if (node.HasChildNodes)
{
foreach (XmlNode Newnode in node.ChildNodes)
{
if (Newnode.HasChildNodes)
{
ret.Add(Newnode.InnerText);
}
}
}
}
}
}
}
return ret;
An example dropdown Field looks like this:
<Field Type="Choice" DisplayName="Media Type" Required="FALSE" Format="Dropdown" FillInChoice="FALSE" ID="{d814daf1-0bd2-48cc-8709-a513a3de4ef4}" SourceID="{1c01c034-f1fd-447f-8ed4-d60b997d0c3a}" StaticName="Media_x0020_Type" Name="Media_x0020_Type" ColName="nvarchar4" RowOrdinal="0" Version="3"><Default>CD/DVD</Default><CHOICES><CHOICE>CD/DVD</CHOICE><CHOICE>Hard Drive</CHOICE><CHOICE>Flash Drive</CHOICE><CHOICE>Virtual</CHOICE></CHOICES></Field>
The other queries I am using for GetListItems seems to work beautfiully.
XElement ziprecords = XElement.Parse(ZipItemsResultNode.OuterXml);
XName name = XName.Get("data", "urn:schemas-microsoft-com:rowset");
var iterationNotes =
from ele in ziprecords.Element(name).Elements()
where ele.Attribute("ows_Title").Value.Contains(fileName
where ele.Attribute("ows_Iteration_x0020_Notes") != null
select new { iterationNote = ele.Attribute("ows_Iteration_x0020_Notes").Value,
fileName = ele.Attribute("ows_Title").Value };
My Solution
I came up with a solution after some searching. I'm not sure why regular Parse and other methods didn't work, but this seems to fix whatever mistakes I was making. I"ll leave this here incase somebody can explain why these steps are needed. I suspect there was an issue with the encoding, which I assumed XDcoument/etc would naturally take care of.
ProuductionResultNode = listservice.GetList(productiontable_listGUID);
XmlDocument xdoc = new XmlDocument();
XmlNamespaceManager nsmgr = new XmlNamespaceManager(xdoc.NameTable);
nsmgr.AddNamespace("ans", "http://schemas.microsoft.com/sharepoint/soap/");
byte[] byteArray = Encoding.ASCII.GetBytes(ProuductionResultNode.SelectSingleNode(".//ans:Fields", nsmgr).OuterXml);
MemoryStream stream = new MemoryStream(byteArray);
XElement xe = XElement.Load(stream);
XElement qry =
(from field in xe.Descendants()
where field.Attribute("Name") != null
where field.Attribute("Name").Value == "Ship_x0020_Via"
select field).Single();
List<string> ret = new List<string>();
foreach (XElement xle in qry.XPathSelectElements(".//ans:CHOICES", nsmgr).Elements())
{
ret.Add(xle.Value);
}

Categories