Injecting HTML at specific location using HTMLAgilityPack - c#

I've been asked to inject a bunch of HTML into a specific point in a HTML document, and have been looking at using HTMLAgilityPack to do so.
The recomended way to do this, as far as I can tell, is to parse using nodes and replace/delete the relevant nodes.
This is my code so far
//Load original HTML
var originalHtml = new HtmlDocument();
originalHtml.Load(#"C:\Temp\test.html");
//Load inject HTML
var inject = new HtmlDocument();
inject.Load(#"C:\Temp\Temp\inject.html");
var injectNode = HtmlNode.CreateNode(inject.Text);
//Get all HTML nodes to inject/delete
var nodesToDelete = originalHtml.DocumentNode.SelectNodes("//p[#style='page-break-after:avoid']");
var countToDelete = nodesToDelete.Count();
//loop through stuff to remove
int count = 0;
foreach (var nodeToDelete in nodesToDelete)
{
count++;
if (count == 1)
{
//replace with inject HTML
nodeToDelete.ParentNode.ReplaceChild(injectNode, nodeToDelete);
}
else if (count <= countToDelete)
{
//remove, as HTML already injected
nodeToDelete.ParentNode.RemoveChild(nodeToDelete);
}
}
What I'm finding, is that the original HTML is not correctly updated, it appears as though it only injects the parent level node, which is a simple and none of the child nodes.
Any help??
Thanks,
Patrick.

Well, I couldn't work out how to do this using HTMLAgilityPack, probably more due to my lack of understanding of nodes more than anything else, but I did find an easy fix using AngleSharp.
//Load original HTML into document
var parser = new HtmlParser();
var htmlDocument = parser.Parse(File.ReadAllText(#"C:\Temp\test.html"));
//Load inject HTML as raw text
var injectHtml = File.ReadAllText(#"C:\Temp\inject.html")
//Get all HTML elements to inject/delete
var elements = htmlDocument.All.Where(e => e.Attributes.Any(a => a.Name == "style" && a.Value == "page-break-after:avoid"));
//loop through stuff to remove
int count = 1;
foreach (var element in elements)
{
if (count == 1)
{
//replace with inject HTML
element.OuterHtml = injectHtml;
}
else
{
//remove, as HTML already injected
element.Remove();
}
count++;
}
//Re-write updated file
File.WriteAllText(#"C:\Temp\test_updated.html", string.Format("{0}{1}{2}{3}","<html>",htmlDocument.Head.OuterHtml,htmlDocument.Body.OuterHtml,"</html>"));

Related

Get only the text of a webpage using HTML Agility Pack?

I'm trying to scrape a web page to get just the text. I'm putting each word into a dictionary and counting how many times each word appears on the page. I'm trying to use HTML Agility Pack as suggested from this post: How to get number of words on a web page?
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
int wordCount = 0;
Dictionary<string, int> dict = new Dictionary<string, int>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
MatchCollection matches = Regex.Matches(node.InnerText, #"\b(?:[a-z]{2,}|[ai])\b", RegexOptions.IgnoreCase);
foreach (Match s in matches)
{
//Add the entry to the dictionary
}
}
However, with my current implementation, I'm still getting lots of results that are from the markup that should not be counted. It's close, but not quite there yet (I don't expect it to be perfect).
I'm using this page as an example. My results are showing a lot of the uses of the words "width" and "googletag", despite those not being in the actual text of the page at all.
Any suggestions on how to fix this? Thanks!
You can't be sure that the word you are searching for is displayed or not to the user as there will be JS execution and CSS rules that will affect that.
The following program does find 0 matches for "width", and "googletag" but finds 126 "html" matches whereas Chrome Ctrl+F finds 106 matches.
Note that the program does not match the word if it's parent node is <script>.
using HtmlAgilityPack;
using System;
namespace WordCounter
{
class Program
{
private static readonly Uri Uri = new Uri("https://www.w3schools.com/html/html_editors.asp");
static void Main(string[] args)
{
var doc = new HtmlWeb().Load(Uri);
var nodes = doc.DocumentNode.SelectSingleNode("//body").DescendantsAndSelf();
var word = Console.ReadLine().ToLower();
while (word != "exit")
{
var count = 0;
foreach (var node in nodes)
{
if (node.NodeType == HtmlNodeType.Text && node.ParentNode.Name != "script" && node.InnerText.ToLower().Contains(word))
{
count++;
}
}
Console.WriteLine($"{word} is displayed {count} times.");
word = Console.ReadLine().ToLower();
}
}
}
}

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

How To change XML namespace of certain element

I have some set of xml generated via xmlserialization of some WCF messages.
Now I want to make a generic method in which I will provide an xml filename and a prefix like mailxml12.
Then in xml file those elements that don't have any namespace prefix in their name should be replaced with mailxml12:
Like source file is:
<DeliveryApptCreateRequest d2p1:ApptType="Pallet" d2p1:PickupOrDelivery="Delivery" d2p1:ShipperApptRequestID="4490660303D5" d2p1:SchedulerCRID="234234" xmlns:d2p1="http://idealliance.org/Specs/mailxml12.0a/mailxml_defs" xmlns="http://idealliance.org/Specs/mailxml12.0a/mailxml_tm">
<SubmittingParty d2p1:MailerID6="123446" d2p1:CRID="342343" d2p1:MaildatUserLicense="A123" />
<SubmittingSoftware d2p1:SoftwareName="asds" d2p1:Vendor="123" d2p1:Version="12" />
<SubmitterTrackingID>2CAD3F71B4405EB16392</SubmitterTrackingID>
<DestinationEntry>No</DestinationEntry>
<OneTimeAppt>
<PreferredAppt>2012-06-29T09:00:00Z</PreferredAppt>
</OneTimeAppt>
<TrailerInfo>
<Trailer>
<TrailerNumber>A</TrailerNumber>
<TrailerLength>20ft</TrailerLength>
</Trailer>
<Carrier>
<CarrierName>N/A</CarrierName>
<URL>http://test.com</URL>
</Carrier>
<BillOfLadingNumber>N/A</BillOfLadingNumber>
</TrailerInfo>
</DeliveryApptCreateRequest>
After the desired method it should be changed into all element name which doesn't have prefix with mailxml:.
Like DeliveryApptCreateRequest should become mailxml:DeliveryApptCreateRequest
while element like d2p1:CompanyName should remain as it is.
I have tried with following code
private void RepalceFile(string xmlfile)
{
XmlDocument doc = new XmlDocument();
doc.Load(xmlfile);
var a = doc.CreateAttribute("xmlns:mailxml12tm");
a.Value = "http://idealliance.org/Specs/mailxml12.0a/mailxml_tm";
doc.DocumentElement.Attributes.Append(a);
doc.DocumentElement.Prefix = "mailxml12tm";
foreach (XmlNode item in doc.SelectNodes("//*"))
{
if (item.Prefix.Length == 0)
item.Prefix = "mailxml12tm";
}
doc.Save(xmlfile);
}
only problem with it is that root element remain as it is while all are changed as i needed
You can just parse the whole XML as a string and insert namespaces where appropriate. This solution, however, can create lots of new strings only used within the algorithm, which is not good for the performance. However, I've written a function parsing it in this manner and it seems to run quite fast for sample XML you've posted ;). I can post it if you would like to use it.
Another solution is loading XML as XmlDocument and taking advantage of the fact it's a tree-like structure. This way, you can create a method recursively adding appropriate namespaces where appropriate.
Unfortunately, XmlNode.Name attribute is read-only and that's why you have to manually copy the entire structure of the xml to change names of some nodes.
I don't have time to write the code right now, so I just let you write it. If you encounter any issues with it, just let me know.
Update
I've tested your code and code suggested by Jeff Mercado and both of them seem to work correctly, at least in the sample XML you've posted in the question. Make sure the XML you are trying to parse is the same as the one you've posted.
Just to make it work and solve adding namespace issue originally asked, you can use the code, which handles the whole XML as a String and parses it manually:
private static String UpdateNodesWithDefaultNamespace(String xml, String defaultNamespace)
{
if (!String.IsNullOrEmpty(xml) && !String.IsNullOrEmpty(defaultNamespace))
{
int currentIndex = 0;
while (currentIndex != -1)
{
//find index of tag opening character
int tagOpenIndex = xml.IndexOf('<', currentIndex);
//no more tag openings are found
if (tagOpenIndex == -1)
{
break;
}
//if it's a closing tag
if (xml[tagOpenIndex + 1] == '/')
{
currentIndex = tagOpenIndex + 1;
}
else
{
currentIndex = tagOpenIndex;
}
//find corresponding tag closing character
int tagCloseIndex = xml.IndexOf('>', tagOpenIndex);
if (tagCloseIndex <= tagOpenIndex)
{
throw new Exception("Invalid XML file.");
}
//look for a colon within currently processed tag
String currentTagSubstring = xml.Substring(tagOpenIndex, tagCloseIndex - tagOpenIndex);
int firstSpaceIndex = currentTagSubstring.IndexOf(' ');
int nameSpaceColonIndex;
//if space was found
if (firstSpaceIndex != -1)
{
//look for namespace colon between tag open character and the first space character
nameSpaceColonIndex = currentTagSubstring.IndexOf(':', 0, firstSpaceIndex);
}
else
{
//look for namespace colon between tag open character and tag close character
nameSpaceColonIndex = currentTagSubstring.IndexOf(':');
}
//if there is no namespace
if (nameSpaceColonIndex == -1)
{
//insert namespace after tag opening characters '<' or '</'
xml = xml.Insert(currentIndex + 1, String.Format("{0}:", defaultNamespace));
}
//look for next tags after current tag closing character
currentIndex = tagCloseIndex;
}
}
return xml;
}
You can check this code out in order to make you app working, however, I strongly encourage you to determine why the other solutions suggested didn't work.
Since in this case you have a default namespace defined, you could just remove the default namespace declaration and add a new declaration for your new prefix using the old namespace name, effectively replacing it.
var prefix = "mailxml";
var content = XElement.Parse(xmlStr);
var defns = content.GetDefaultNamespace();
content.Attribute("xmlns").Remove();
content.Add(new XAttribute(XNamespace.Xmlns + prefix, defns.NamespaceName));
#JeffMercado's solution didn't work for me (probably since I didn't have a default namespace).
I ended up using:
XNamespace ns = Constants.Namespace;
el.Name = (ns + el.Name.LocalName) as XName;
To change the namespace of a whole document I used:
private void rewriteNamespace(XElement el)
{
// Change namespace
XNamespace ns = Constants.Namespace;
el.Name = (ns + el.Name.LocalName) as XName;
if (!el.HasElements)
return;
foreach (XElement d in el.Elements())
rewriteNamespace(d);
}
Usage:
var doc = XDocument.parse(xmlStr);
rewriteNamespace(doc.Root)
HTH

Counting the number of elements in an XML document

I am wondering if it is possible to count the number of elements in an XML document preferably being able to fitler using something similar to where (string)query.Attribute("attName") == att.
Using the best of my knowledge i have tried the following but unfortunatly i can't seem to make it work.
var listElements = reader.Elements("shortlist");
foreach (var element in listElements)
{
XElement _xml;
location.Position = 0;
System.IO.StreamReader file = new System.IO.StreamReader(location);
_xml = XElement.Parse(file.ReadToEnd());
XAttribute attName = _xml.Attribute("attN");
if (attName.Value == att)
{
Count++;
}
}
Thanks!
Given that doc is an instance of XDocument
doc.Root.Descendants().Count(d => (string)d.Attribute("attName") == "value");
That would probably be a good application for using XPath.
http://support.microsoft.com/kb/308333/en-us
An xpath could be "count(//*[#attName='attValue'])".
XmlDocument x = XmlDocument.Load("data.xml"); //pls excuse if i got the syntax wrong
XmlNodeList n = x.SelectNodes("//*[#attName='attValue']");
//Selects any element occuring anywhere in the document with Attribute attName='attValue'
int tadaa = n.Count;

Selecting attribute values with html Agility Pack

I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:
//div[#id='topslot']/a/img/#src
As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?
I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.
Are there any documentation for Html Agility Pack?
You can directly grab the attribute if you use the HtmlNavigator instead.
//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);
//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();
//Get value from given xpath
string xpath = "//div[#id='topslot']/a/img/#src";
string val = navigator.SelectSingleNode(xpath).Value;
Html Agility Pack does not support attribute selection.
You may use the method "GetAttributeValue".
Example:
//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[#class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
string url = MensaNode.GetAttributeValue("href", "not found");
MessageBox.Show(url);
}
Html Agility Pack will support it soon.
http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342
Reading and Writing Attributes with Html Agility Pack
You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.
In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.
Read and write:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
for (int i = 0; i < nodes.Count; i++)
{
if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
{
language = nodes[i].Attributes["lang"].Value; //Get attribute
nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
}
}
Read only:
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(this.All);
string language = string.Empty;
var nodes = doc.DocumentNode.SelectNodes("//html");
foreach (HtmlNode a in nodes)
{
if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
{
language = a.Attributes["lang"].Value;
}
}
I used the following way to obtain the attributes of an image.
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();
You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.
Hope I helped.
I just faced this problem and solved it using GetAttributeValue method.
//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");
//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
return new
{
name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
imageUrl = node.QuerySelector("tr td div a img")
.GetAttributeValue("src", "default-url")
};
});

Categories