extracting just page text using HTMLAgilityPack

extracting just page text using HTMLAgilityPack - c#

Ok so i am really new to XPath queries used in HTMLAgilityPack.
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
However not only i am not getting just text i am also getting numerous /r /n characters.
Please i require a little guidance in this regard.

If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
You can further exclude text nodes that are only whitespace using XPath's normalize-space():
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
or the shorter
//*[not(self::script or self::style)]/text()[normalize-space()]
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as #aL3891 suggests.

If \r \n characters in the final string is the problem, you could just remove them after the fact:
TempString.ToString().Replace("\r", "").Replace("\n", "");

Related

Breakdown of HTML RTF string for 3rd Party Formatting

I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...

I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text

So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.

Find HTML / XML node using RegEx

I am parsing a number of HTML documents, and within each need to try and extract a UK postal address. In order to do so I am parsing the HTML with AngleSharp and then looking for nodes with TextContent that match my RegEx:
var parser = new HtmlParser();
var source = "<html><head><title>Test Title</title></head><body><h1>Some example source</h1><p>This is a paragraph element and example postode EC1A 4NP</body></html>";
var document = parser.Parse(source);
Regex searchTerm = new Regex("([A-PR-UWYZ][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)");
var list = document.All.Where(m => searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));
This returns 3 results, the html, body and p elements. The only element I want to return is the p element as that has the innerText matching the regex correctly. There may also be more than one match on a page so I can't just return the last result. I am looking to just return any elements where the text in that element (not in any child nodes) matches the regex.
Edit
I don't know in advance the doc structure or even the tag that the postcode will be within which is why I'm using regex. Once I have the result I am planning on traversing the dom to obtain the rest of the address so I don't just want to treat the doc as a string

If you are looking to extract a particular node within a well-formed HTML/XML document then have a look at utilising XPath. There's some examples here on MSDN
You can use utilities libraries such as HTML Tidy to "clean-up" the html and make it well formed if it isn't already.

Ok, I took a different approach in the end. I searched the HTML doc as a string with the RegEx NOT to parse the HTML but simply to find the exact match value. once I had that value it was simple enough to use an xpath expression to return the node. In the example above, the regex search returns EC1A 4NP and the following XPATH:
//*[contains(text(),'EC1A 4NP')]
returns the required node. For XPath ease, I switched from AngleSharp to HtmlAgilityPack for the HTML parsing

I've had a quick look at the doco of parser. Below is what you need to do if you want to check only the text in <p> tags.
var list = document.All.Where(m => m.LocalName.ToUpper() == "P" && searchTerm.IsMatch((m.TextContent ?? "").ToUpper()));

HTML Strip Function

There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly.
Articles on Stackoverflow I had a look at:
Regular expression to remove HTML tags
How to clean HTML tags using C#
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.

Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
if (!allowedTags.Contains(node.Name.ToUpper()))
{
HtmlNode parent = node.ParentNode;
parent.RemoveChild(node,true);
}
}
Remove attributes you don't want & remove properties
string[] allowedAttributes = { "STYLE", "SRC" };
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
foreach (HtmlAttribute att in node.Attributes)
{
if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
else
{
string newAttrib = string.Empty;
//do string manipulation based on your checking accepted properties
//one way would be to split the attribute.Value by a semicolon and do a
//String.Contains() on each one, not appending those that don't match. Maybe
//use a StringBuilder instead too
att.Value = newAttrib;
}
}
foreach (HtmlAttribute attribute in attributesToRemove)
{
node.Attributes.Remove(attribute);
}
}

I would probably actually just write this myself as a multi-step process:
1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)
2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.
You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.
Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!
Hopefully that helps as a starting point.

Remove html special Char from displayed text

I have Xml witch a convert to plain text and then display with html formatting in a web browser.
At the end of each line the symbol ¶ appears i would like to remove the symbol or replace it with a .
Does anyone know how i could do this?
This is how i convert XML to plain text:
XmlDocument doc = new XmlDocument();
doc.LoadXml(this.dataGridViewResult.SelectedRows[0].Cells["XMLEvent"].Value.ToString());
StringBuilder sb = new StringBuilder();
foreach (XmlNode node in doc.DocumentElement.ChildNodes)
{
sb.Append(char.ToUpper(node.Name[0]));
sb.Append(node.Name.Substring(1));
sb.Append(' ');
sb.AppendLine(node.InnerText);
}

Where does the '¶' appear? Is it when you open the converted text file in an editor?
Normally that sign is used to visualize the end of line in a text editor, and it's not really part of you text. In many cases you have an option in the text editor to show/hide line ending markers.
However, if the output you are interested in is HTML, the character should not appear here.

Try this:
sb.AppendLine(node.InnerText.TrimEnd('¶'));
or
sb.AppendLine(node.InnerText.Replace("¶","."));

After the foreach loop, Try:
sb.Replace("¶", ".");

Specifically in your case (assuming it's always at the end of the line), I'd use:
sb.AppendLine(node.InnerText.Replace('\u00b6', '.'));
If you want to keep your code unicode free.

How to convert InnerText to InnerHtml in Webbrowser Control in C#?

I'm working on a WYSIWYG editor with builtin spell checker Hunspell and online highlighting of misspelled words. I'm using Webbrowser control as a html handler. It's a way easy to spell check text than html in webbrowser control, but following this way I'm losing all html formatting.
So the question is: is there any way to spell check body innertext and then convert it to body innerhtml with previous formatting? (with no use of HtmlAgilityPack or Majestic12 or SgmlReader or ZetaHtmlTidy).
Thanks in advance.

As opposed to checking the spelling of the innterText property of a given element, a better approach might be to loop through the child elements, and check the spelling of each child's innerText instead.
This approach, while possibly limiting context-based spell-checking, should keep the markup intact.
Note: You might want to take into consideration that each child node may also contain further children.

I chose to check the spelling of the innerText property, but when replacing any changed words, I replaced them within the innerHTML. This was rather easy when changing all instances of a misspelled word. Simply use a Regular Expression to gather the indices of all matching words in the innerHTML and replace each one.
Regex wordEx = new Regex(#"[A-Za-z]", RegexOptions.Compiled);
MatchCollection mcol = wordEx.Matches(webEditor.Document.Body.InnerHtml);
foreach (Match m in mcol)
{
//Basic checking for whether this word is an HTML tag. This is not perfect.
if (m.Value == e.Word && webEditor.Document.Body.InnerHtml.Substring(m.Index -1, 1) != "<")
{
wordIndeces.Add(m.Index);
}
}
foreach (int curWordTextIndex in wordIndeces)
{
Word word = Word.GetWordFromPosition(webEditor.Document.Body.InnerHtml, curWordTextIndex);
string tmpText = webEditor.Document.Body.InnerHtml.Remove(word.Start, word.Length);
webEditor.Document.Body.InnerHtml = tmpText.Insert(word.Start, e.NewWord);
}
UpdateSpellingForm(e.TextIndex);
When replacing a single instance, I just looped through the InnerText to find which instance needs to be replaced. Then I looped through the InnerHTML until I found the correct instance and replaced it.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

extracting just page text using HTMLAgilityPack - c#

If \r \n characters in the final string is the problem, you could just remove them after the fact: TempString.ToString().Replace("\r", "").Replace("\n", "");

Related

Breakdown of HTML RTF string for 3rd Party Formatting

Find HTML / XML node using RegEx

HTML Strip Function

Remove html special Char from displayed text

How to convert InnerText to InnerHtml in Webbrowser Control in C#?

Categories

Resources