Breakdown of HTML RTF string for 3rd Party Formatting - c#

I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...

I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text

So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.

Related

Remove HTML from string -- comments

I have the following text which still contains some HTML code:
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Hi There,
 
For the product team to have any chance in analysing this issue we need clarification on how to reproduce the problem.
My code at the moment is:
string replacedEmailText = Regex.Replace(emailText, #"<(.|\n)*?>", string.Empty);
string finalText = WebUtility.HtmlDecode(replacedEmailText);
How do I remove the top lines containing :
v\:* {behavior:url(#default#VML);}
?
For this specific example, you could use .*;}(\r\n|\r|\n)* as your replacement pattern.
However, this will fail, when the text contains the sequence ;}. If this is possible, you might want to go further into detail on how the html lines look like:
.*\(#default#VML\);}(\r\n|\r|\n)*
Explanation:
.*: matches any character except for new line and carriage return zero ore more consecutive times
\(#default#VML\);}: matches the sequence (#default#VML)
(\r\n|\r|\n)*: removes new line and carriage return zero or more consecutive times
Demo here
Do not try to strip HTML from text using regex, use some whitelisting library like https://github.com/mganss/HtmlSanitizer

HTML Strip Function

There is a tough nut to crack.
I have a HTML which needs to be stripped of some tags, attributes AND properties.
Basically there are three different approaches which are to be considered:
String Operations: Iterate through the HTML string and strip it via string operations 'manually'
Regex: Parsing HTML with RegEx is evil. Is stripping HTML evil too?
Using a library to strip it (e.g. HTML Agility Pack)
My wish is that I have lists for:
acceptedTags (e.g. SPAN, DIV, OL, LI)
acceptedAttributes (e.g. STYLE, SRC)
acceptedProperties (e.g. TEXT-ALIGN, FONT-WEIGHT, COLOR, BACKGROUND-COLOR)
Which I can pass to this function which strips the HTML.
Example Input:
<BODY STYLE="font-family:Tahoma;font-size:11;"> <DIV STYLE="margin:0 0 0 0;text-align:Left;font-family:Tahoma;font-size:16;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;font-family:tahoma;font-size:11;">Hello</SPAN></BODY>
Example Output (with parameter lists from above):
<DIV STYLE="text-align:Left;"> <SPAN STYLE="font-weight:bold;color:#000000;background-color:#FF0000;">Hello</SPAN>
the entire tag Body is stripped (not accepted tag)
properties margin, font-family and font-size are stripped from DIV-Tag
properties font-family and font-size are stripped from SPAN-Tag.
What have I tried?
Regex seemed to be the best approach at the first glance. But I couldn't get it working properly.
Articles on Stackoverflow I had a look at:
Regular expression to remove HTML tags
How to clean HTML tags using C#
...and many more.
I tried the following regex:
Dim AcceptableTags As String = "font|span|html|i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
Dim WhiteListPattern As String = "</?(?(?=" & AcceptableTags & _
")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
Dim Html as String = Regex.Replace(b.HTML, WhiteListPattern, "", RegexOptions.Compiled)
However, this is only removing tags and no attributes or properties!
I'm definitely not looking for someone who's doing the whole job. Rather for someone, who points me to the right direction.
I'm happy with either C# or VB.NET as answers.
Definitely use a library! (See this)
With HTMLAgilityPack you can do pretty much everything you want:
Remove tags you don't want:
string[] allowedTags = {"SPAN", "DIV", "OL", "LI"};
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
if (!allowedTags.Contains(node.Name.ToUpper()))
{
HtmlNode parent = node.ParentNode;
parent.RemoveChild(node,true);
}
}
Remove attributes you don't want & remove properties
string[] allowedAttributes = { "STYLE", "SRC" };
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//node()"))
{
List<HtmlAttribute> attributesToRemove = new List<HtmlAttribute>();
foreach (HtmlAttribute att in node.Attributes)
{
if (!allowedAttributes.Contains(att.Name.ToUpper())) attributesToRemove.Add(att);
else
{
string newAttrib = string.Empty;
//do string manipulation based on your checking accepted properties
//one way would be to split the attribute.Value by a semicolon and do a
//String.Contains() on each one, not appending those that don't match. Maybe
//use a StringBuilder instead too
att.Value = newAttrib;
}
}
foreach (HtmlAttribute attribute in attributesToRemove)
{
node.Attributes.Remove(attribute);
}
}
I would probably actually just write this myself as a multi-step process:
1) Exclude all rules for removing properties from tags that are listed as tags to be removed (the tags won't be there anyway!)
2) Walk the document, taking a copy of the document without excluded tags (i.e. in your example, copy everything up until "< div" then wait until I see ">" before continuing to copy. If I'm in copy mode, and I see "ExcludedTag=" then stop copying until I see quotation mark.
You'll probably want to do some pre-work validation on the html and getting the formatting the same, etc. before running this process to avoid broken output.
Oh, and copy in chunks, i.e. just keep the index of copy start until you reach copy end, then copy the whole chunk, not individual characters!
Hopefully that helps as a starting point.

Removing HTML from a string

I have a table (a Wijmo Grid). The column Log takes some text.
The user is allowed to write HTML in the text, because the same text is also used when mailed to make it look pretty and well styled.
Let's say the text is:
var text = "Hello friend <br> How are you? <h1> from me </h1>";
Is there any method or JSON.stringify() og HTML.enocde() i can/should use to get:
var textWithoutHtml = magic(text); // "Hello friend How are you? from me"
One of the problems is that if the text include "<br>" it break to next line i the row of the table, and it's possible to see the top-half of the second line in the row, witch doesn't look good.
var text = "Hello friend <br> How are you? <h1> from me </h1>";
var newText = text.replace(/(<([^>]+)>)/ig, "");
fiddle: http://jsfiddle.net/EfRs6/
As far as i understood your question you can encode the values like this in C#
string encodedValue= HttpUtility.HtmlEncode(txtInput.Text);
Note: here txtInput is the id of TextBox on your page.
You may try like this:
string s = Regex.Replace("Hello friend <br> How are you? <h1> from me </h1>", #"<[^>]+>| ", "").Trim();
You can also check the HTML Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
<[^>]+>| /
1st Alternative: <[^>]+>
< matches the characters < literally
[^>]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
> a single character in the list > literally (case sensitive)
> matches the characters > literally
2nd Alternative:
matches the characters literally (case sensitive)

extracting just page text using HTMLAgilityPack

Ok so i am really new to XPath queries used in HTMLAgilityPack.
So lets consider this page http://health.yahoo.net/articles/healthcare/what-your-favorite-flavor-says-about-you. What i want is to extract just the page content and nothing else.
So for that i first remove script and style tags.
Document = new HtmlDocument();
Document.LoadHtml(page);
TempString = new StringBuilder();
foreach (HtmlNode style in Document.DocumentNode.Descendants("style").ToArray())
{
style.Remove();
}
foreach (HtmlNode script in Document.DocumentNode.Descendants("script").ToArray())
{
script.Remove();
}
After that i am trying to use //text() to get all the text nodes.
foreach (HtmlTextNode node in Document.DocumentNode.SelectNodes("//text()"))
{
TempString.AppendLine(node.InnerText);
}
However not only i am not getting just text i am also getting numerous /r /n characters.
Please i require a little guidance in this regard.
If you consider that script and style nodes only have text nodes for children, you can use this XPath expression to get text nodes that are not in script or style tags, so that you don't need to remove the nodes beforehand:
//*[not(self::script or self::style)]/text()
You can further exclude text nodes that are only whitespace using XPath's normalize-space():
//*[not(self::script or self::style)]/text()[not(normalize-space(.)="")]
or the shorter
//*[not(self::script or self::style)]/text()[normalize-space()]
But you will still get text nodes that may have leading or trailing whitespace. This can be handled in your application as #aL3891 suggests.
If \r \n characters in the final string is the problem, you could just remove them after the fact:
TempString.ToString().Replace("\r", "").Replace("\n", "");

How do I find a HTML div contains specific text after a text prefix?

I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

Categories