Remove HTML from string -- comments - c#

I have the following text which still contains some HTML code:
v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Hi There,
 
For the product team to have any chance in analysing this issue we need clarification on how to reproduce the problem.
My code at the moment is:
string replacedEmailText = Regex.Replace(emailText, #"<(.|\n)*?>", string.Empty);
string finalText = WebUtility.HtmlDecode(replacedEmailText);
How do I remove the top lines containing :
v\:* {behavior:url(#default#VML);}
?

For this specific example, you could use .*;}(\r\n|\r|\n)* as your replacement pattern.
However, this will fail, when the text contains the sequence ;}. If this is possible, you might want to go further into detail on how the html lines look like:
.*\(#default#VML\);}(\r\n|\r|\n)*
Explanation:
.*: matches any character except for new line and carriage return zero ore more consecutive times
\(#default#VML\);}: matches the sequence (#default#VML)
(\r\n|\r|\n)*: removes new line and carriage return zero or more consecutive times
Demo here

Do not try to strip HTML from text using regex, use some whitelisting library like https://github.com/mganss/HtmlSanitizer

Related

Breakdown of HTML RTF string for 3rd Party Formatting

I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...
I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text
So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.

Removing HTML from a string

I have a table (a Wijmo Grid). The column Log takes some text.
The user is allowed to write HTML in the text, because the same text is also used when mailed to make it look pretty and well styled.
Let's say the text is:
var text = "Hello friend <br> How are you? <h1> from me </h1>";
Is there any method or JSON.stringify() og HTML.enocde() i can/should use to get:
var textWithoutHtml = magic(text); // "Hello friend How are you? from me"
One of the problems is that if the text include "<br>" it break to next line i the row of the table, and it's possible to see the top-half of the second line in the row, witch doesn't look good.
var text = "Hello friend <br> How are you? <h1> from me </h1>";
var newText = text.replace(/(<([^>]+)>)/ig, "");
fiddle: http://jsfiddle.net/EfRs6/
As far as i understood your question you can encode the values like this in C#
string encodedValue= HttpUtility.HtmlEncode(txtInput.Text);
Note: here txtInput is the id of TextBox on your page.
You may try like this:
string s = Regex.Replace("Hello friend <br> How are you? <h1> from me </h1>", #"<[^>]+>| ", "").Trim();
You can also check the HTML Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
<[^>]+>| /
1st Alternative: <[^>]+>
< matches the characters < literally
[^>]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
> a single character in the list > literally (case sensitive)
> matches the characters > literally
2nd Alternative:
matches the characters literally (case sensitive)

Ignoring   when parsing with HtmlAgilityPack

I'm parsing html table in c# using Html Agility Pack that contains non-breaking space.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
Where page is string containing table with special characters   within text.
<td> test</td>
<td>number = 123 </td>
Using SelectSingleNode(".//td").InnerText will contains this special characters but i want to ignore them.
Is there some elegant way to ignore this (with or without help of Html Agility Pack) without modifying source table?
You could use HtmlDecode
string foo = HttpUtility.HtmlDecode("Special char:  ");
Will give you a string:
Special char:
The "Special Character" non-breaking-space of which you speak is a valid character which can perfectly legitimately appear in text, just as "fancy quotes", em-dash etc can.
Often we want to treat certain characters as being equivalent.
So you might want to treat an em-dash, en-dash and minus sign/dash as
being the same.
Or fancy quotes as the same as straight quotes.
Or the non-breaking-space as an ordinary space.
However this is not something HTML Agility pack can help with. You need to use something like string.Replace or your own canonicalization function to do this.
I would suggest something like:
static string CleanupStringForMyApp(string s){
// replace characters with their equivalents
s = s.Replace(string.FromCharCode(160), " ");
// Add any more replacements you want to do here
return s;
}

Remove text enclosed in a div tag using C# Regex

I have a string as follows:
string chart = "<div id=\"divOne\">Label.</div>;" which is generated dynamically without my control and would like to remove the text "Label." from the enclosing div element.
I tried the following but my regex knowledge still limited to get it working:
System.Text.RegularExpressions.Regex.Replace(chart, #"/(<div[^>]+>)[^<]+(<\/div>)/i", "");
Using LinqPad I got this snippet working. Hopefully it solves your problem correctly.
string chart = "<div id=\"divOne\">Label.</div>;";
var regex = new System.Text.RegularExpressions.Regex(#">.*<");
var result = regex.Replace(chart, "><");
result.Dump(); // prints <div id="divOne"></div>
Essentially, it finds all characters between the opposing angle brackets, and replaces it.
The approach you take depends on how robust the replacement needs to be. If you're using this at a more general level where you want to target the specific node, you should use a MatchEvaluator. This example produces a similar result:
string pattern = #"<(?<element>\w*) (?<attrs>.*)>(?<contents>.*)</(?<elementClose>.*>)";
var x = System.Text.RegularExpressions
.Regex.Replace(chart, pattern, m => m.Value.Replace(m.Groups["contents"].Value, ""));
The pattern you use in this case is customizable, but it takes advantage of named group captures. It allows you to isolate portions of the match, and refer to them by name.
Your regex looks good to me, (but don't specify the '/.../i' delimiters and modifier). And use '$1$2' as your replacement string:
var re = new System.Text.RegularExpressions.Regex(#"(?i)(<div[^>]+>)[^<]+(<\/div>)");
var text = regex.Replace(text, "$1$2");
Try this for your regex:
<div\b[^>]*>(.*?)<\/div>
The following produces the output <div></div>
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(#"<div\b[^>]*>(.*?)<\/div>");
Console.WriteLine(regex.Replace("<div>Label 1.</div>","<div></div>"));
Console.ReadLine();
You must just write a pattern to select text in the div tag.
Regex.Replace(chart,yourPattern,string.empty);
I'm a little confused by your question; it sounds like you are parsing through some pre-generated HTML and want to remove all instances of the value of chart that occur within in a <div> tag. If that's correct, try this:
"(<div[^>]*>[^<]*)"+chart+"([^<]*</div>)"
Return the first & second groupings concatenated together and you should have your <div> back sans chart.
Here is a better way than Regex.
var element = XElement.Parse("<div id=\"divOne\">Label.</div>");
element.Value = "";
var value = element.ToString();
RegEx match open tags except XHTML self-contained tags

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Categories