Strip parsed html text from html comments using AngleSharp - c#

I use the code below to strip specific html tags from parsed html using AngleSharp (as it is recommendable over using regular expressions to do such jobs (AngleSharp is currently maintained, HtmlAgilityPack not, hence I have been moving to the latter).
It works great - but now I want to remove html comments as well. Meaning whatever is found between <!-- and --> tags.
How would this be achieved using AngleSharp ? Using QuerySelector does not seem suiting here.
private string ExtractContentFromHtml(string input)
{
List<string> tagsToRemove = new List<string>
{
"script",
"style",
"img"
};
var config = Configuration.Default.WithJavaScript();
HtmlParser hp = new HtmlParser(config);
List<IElement> tags = new List<IElement>();
List<string> nodeTypes = new List<string>();
var hpResult = hp.Parse(input);
try
{
foreach (var tagToRemove in tagsToRemove)
tags.AddRange(hpResult.QuerySelectorAll(tagToRemove));
foreach (var tag in tags)
tag.Remove();
}
catch (Exception ex)
{
_errors.Add(string.Format("Error in cleaning html. {0}", ex.Message));
}
var content = hpResult.QuerySelector("body");
return (content).InnerHtml;
}

After playing with the code above and AngleSharp's API, I came up with the following working solution.
Initially I thought I could replace all my tag-removing stuff and solely rely on treating text nodes only, but this is not recommendable,
since some text nodes will be generated on the fly via javascript code, meaning, you need to remove javascript nodes anyway. So I left the style + img removals as well.
Worth mentioning as well that the DOM classifies nodes according to types, and one is able to find comments by searching for nodes of type 8.
private string ExtractContentFromHtml(string input)
{
List<string> tagsToRemove = new List<string>
{
"script",
"style",
"img"
};
var config = Configuration.Default.WithJavaScript();
HtmlParser hp = new HtmlParser(config);
List<IElement> tags = new List<IElement>();
List<string> nodeTypes = new List<string>();
var hpResult = hp.Parse(input);
List<string> textNodesValues = new List<string>();
try
{
foreach (var tagToRemove in tagsToRemove)
tags.AddRange(hpResult.QuerySelectorAll(tagToRemove));
foreach (var tag in tags)
tag.Remove();
/*
the following will not work, because text nodes that are not immediate children will not be considered
textNodesValues = hpResult.All.Where(n => n.NodeType == NodeType.Text).Select(n => n.TextContent).ToList();
*/
var treeWalker = hpResult.CreateTreeWalker(hpResult, FilterSettings.Text);
var textNode = treeWalker.ToNext();
while (textNode != null)
{
textNodesValues.Add(textNode.TextContent);
textNode = treeWalker.ToNext();
}
}
catch (Exception ex)
{
_errors.Add(string.Format("Error in cleaning html. {0}", ex.Message));
}
return string.Join(" ", textNodesValues);
}

Related

Apose.Words ImportNode ignores font formatting when appendingchild

I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
//   is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.

Inserting Custom Element with AngleSharp

I'm trying to update a site that uses an sanitizer based on AngleSharp to process user-generated HTML content. The site users need to be able to embed iframes, and I am trying to use a whitelist to control what domains the frame can load. I'd like to rewrite the 'blocked' iframes to a new custom element "blocked-iframe" that will then be stripped out by the sanitizer, so we can review if other domains need to be added to the whitelist.
I'm trying to use a solution based on this answer: https://stackoverflow.com/a/55276825/794
It looks like so:
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) || !Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Insert(AdjacentPosition.BeforeBegin, newElement.OuterHtml);
element.Remove();
}
}
return doc.FirstElementChild.OuterHtml;
}
It ostensibly works but I notice that the angle brackets in the new element's tag are being escaped on insertion, so the result just gets written into the page as text. I think I could build a map of replacements and just execute them against the string before sending back but I'm wondering if theres a way to do it using AngleSharp's API. The site is using 0.9.9 currently and I'm not sure how far ahead we'll be able to update considering some of the other dependencies in play.
Digging around in the source I found the ReplaceChild method in INode, which works if called from the parent of element
string BlockIFrames(string content)
{
var parser = new HtmlParser(new HtmlParserOptions { });
var doc = parser.Parse(content);
foreach (var element in doc.QuerySelectorAll("iframe"))
{
var src = element.GetAttribute("src");
if (string.IsNullOrEmpty(src) ||
!Settings.Sanitization.IFrameWhitelist.Any(wls => src.StartsWith(wls)))
{
var newElement = doc.CreateElement("blocked-iframe");
foreach (var attr in element.Attributes)
{
newElement.SetAttribute(attr.Name, attr.Value);
}
element.Parent.ReplaceChild(newElement, element);
}
}
return doc.FirstElementChild.OuterHtml;
}
I will keep testing but this seems decent enough to me, if there is a better way I'd love to hear it.

Aspose PDF - get text from page that has a matching string

I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}

null value exception for a list in a winform

I have 2 lists:
public List<string> my_link = new List<string>();
public List<string> english_word = new List<string>();
I am scraping some links from a page and save them onto "my_link";for this I am using these codes like:
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://search.freefind.com/find.html?id=59478474&pid=r&ics=1&query=" + x);
HtmlNodeCollection nodes=doc.DocumentNode.SelectNodes("//font[#class='search-results']//a");
try
{
foreach (HtmlNode n in nodes)
{
link = n.InnerHtml;
link = link.Trim();
my_link.Add(link);
}
}
catch (NullReferenceException )
{
MessageBox.Show("NO link found ");
}
Then i am scraping some content going on that links which I scrapped and I stored that content of each link on a english_word.Add(q); It can scrape content from all links except the last one.my code is like that
foreach (string ss in my_link)
{
HtmlWeb web2 = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc2 = web2.Load(ss);
HtmlNodeCollection nodes2 = doc2.DocumentNode.SelectNodes("//table[#id='table1']//tr[position()>1]//td[position()=2]");
try
{
foreach (HtmlNode nn in nodes2)
{
q = nn.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
english_word.Add(q);
}
}
catch (NullReferenceException ex)
{
MessageBox.Show("No english word is found ");
}
}
For last link only it shows "No english word is found "
What am I doing wrong?
First, catching a NullReferenceException here is not a very good idea. It's better to check for null where you're expecting nulls.
Second, most probably you get this exception because of HtmlNode.SelectNodes method returns null (not an empty collection of nodes, as you've been expected) if no nodes found. See HTML Agility Pack Null Reference, C#/ Html Agility pack error “Value cannot be null. Parameter name: Source.”, and a discussion on CodePlex.
So, instead of a try .. catch block you could use something like:
if (nodes2 != null)
{
foreach (HtmlNode nn in nodes2)
{
q = nn.InnerText;
q = System.Net.WebUtility.HtmlDecode(q);
q = q.Trim();
english_word.Add(q);
}
}
else
{
MessageBox.Show("No english word is found ");
}
Change your catch statement to catch all exceptions, not just NullReferenceException.
Debugger is your friend, use it. I'm guessing that you get exception somewhere before adding a new word to the list. Set a breakpoint in your foreach loop.

Select elements added to the DOM by a script

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers
Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

Categories