There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly.
Articles on Stackoverflow I had a look at:
Regular expression to remove HTML tags
How to clean HTML tags using C#
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.
Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
if (!allowedTags.Contains(node.Name.ToUpper()))
{
HtmlNode parent = node.ParentNode;
parent.RemoveChild(node,true);
}
}
Remove attributes you don't want & remove properties
string[] allowedAttributes = { "STYLE", "SRC" };
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
foreach (HtmlAttribute att in node.Attributes)
{
if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
else
{
string newAttrib = string.Empty;
//do string manipulation based on your checking accepted properties
//one way would be to split the attribute.Value by a semicolon and do a
//String.Contains() on each one, not appending those that don't match. Maybe
//use a StringBuilder instead too
att.Value = newAttrib;
}
}
foreach (HtmlAttribute attribute in attributesToRemove)
{
node.Attributes.Remove(attribute);
}
}
I would probably actually just write this myself as a multi-step process:
1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)
2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.
You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.
Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!
Hopefully that helps as a starting point.
Related
I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...
I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text
So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.
I am trying to retrieve iframe tags and attributes from an HTML input.
Sample input
<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>
I have been trying to collect them using the following regex:
<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>
This results in
This is exactly the format I want.
The problem is, if the HTML attributes are in a different order this regex won't work.
Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?
Here is a regex that will ignore the order of attributes:
(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>
RegexStorm demo
C# sample code:
var rx = new Regex(#"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = #"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();
Output:
Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.
You can deal with this in 2 ways:
The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.
The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.
As stated, you should go with the first option.
You can use this regex
<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>
it matches each 'name'='value' pair recursively and stores it in the same order in matches, you can iterate through the mathes to get names and values sequentially. Caters for most chars in value but you may add a few more if needed.
With Html Agility Pack (to be had via nuget):
using System;
using HtmlAgilityPack;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);
HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");
foreach (HtmlNode iframe in iframes)
{
Console.WriteLine(iframe.GetAttributeValue("width","null"));
Console.WriteLine(iframe.GetAttributeValue("height", "null"));
Console.WriteLine(iframe.GetAttributeValue("src","null"));
}
}
}
}
You need to use an OR operator (|). See changes below
<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>
Ok so i am really new to XPath queries used in HTMLAgilityPack.
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
However not only i am not getting just text i am also getting numerous /r /n characters.
Please i require a little guidance in this regard.
If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
You can further exclude text nodes that are only whitespace using XPath's normalize-space():
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
or the shorter
//*[not(self::script or self::style)]/text()[normalize-space()]
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as #aL3891 suggests.
If \r \n characters in the final string is the problem, you could just remove them after the fact:
TempString.ToString().Replace("\r", "").Replace("\n", "");
I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).
I need to write some code that will search and replace whole words in a string that are outside HTML tags. So if I have this string:
string content = "the brown fox jumped over <b>the</b> lazy dog over there";
string keyword = "the";
I need to something like:
if (content.ToLower().Contains(keyword.ToLower()))
content = content.Replace(keyword, String.Format("<span style=\"background-color:yellow;\">{0}</span>", keyword));
but I don't want to replace the "the" in the bold tags or the "the" in "there", just the first "the".
you can use this library to parse you html and to replace only the words that are not in any html, to replace only the word "the" and not "three" use RegEx.Replace("the\s+"...) instead of string replace
Try this:
content = RegEx.Replace(content, "(?<!>)"
+ keyword
+ "(?!(<|\w))", "<span blah...>" + keyword + '</span>';
Edit: I fixed the "these" case, but not the case where more than the keyword is wrapped in HTML, e.g., "fox jumped over the lazy dog."
What you're asking for is going to be nearly impossible with RegEx and normal, everyday HTML, because to know if you're "inside" a tag, you would have to "pair" each start and end tag, and ignore tags that are intended to be self-closing (BR and IMG, for instance).
If this is merely eye candy for a web site, I suggest going the other route: fix your CSS so the SPAN you are adding only impacts the HTML outside of a tag.
For example:
content = content.Replace("the", "<span class=\"highlight\">the</span>");
Then, in your CSS:
span.highlight { background-color: yellow; }
b span.highlight,
i span.highlight,
em span.highlight,
strong span.highlight,
p span.highlight,
blockquote span.highlight { background: none; }
Just add an exclusion for each HTML tag whose contents should not be highlighted.
I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.
Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.
Here are some potential gotchas:
If it's not XHTML, you need a list of tags which are always empty:
<hr> , <br> and <img> (are there more?).
For all opening tags, if it ends in />, it's immediately closed - {} rather than {.
Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).
Super-permissive generous browser interpretations like
"<p> <p>" = "<p> </p><p>" = {}{
Quoted entities are NOT allowed to contain <> (they need to use <), but maybe browsers are super permissive there as well.
Essentially, if you want to parse correct HTML markup, there's no problem.
So, the algorithm:
"end of previous tag" = start of string
repeatedly search for the next open-tag (case insensitive), or end of string:
< *([^ >/]+)[^/>]*(/?) *>|$
handle (end of previous tag, start of match) as a region outside all tags.
set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,
while depth > 0, scan for next (also case insensitive):
< *(/?) *$tagname *(/?) *>
If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)
Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.
That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).
Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).
As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.
You'll need to give more details.
For example:
<p>the brown fox</p>
is technically inside HTML tags.