Removing HTML from a string - c#

I have a table (a Wijmo Grid). The column Log takes some text.
The user is allowed to write HTML in the text, because the same text is also used when mailed to make it look pretty and well styled.
Let's say the text is:
var text = "Hello friend <br> How are you? <h1> from me </h1>";
Is there any method or JSON.stringify() og HTML.enocde() i can/should use to get:
var textWithoutHtml = magic(text); // "Hello friend How are you? from me"
One of the problems is that if the text include "<br>" it break to next line i the row of the table, and it's possible to see the top-half of the second line in the row, witch doesn't look good.

var text = "Hello friend <br> How are you? <h1> from me </h1>";
var newText = text.replace(/(<([^>]+)>)/ig, "");
fiddle: http://jsfiddle.net/EfRs6/

As far as i understood your question you can encode the values like this in C#
string encodedValue= HttpUtility.HtmlEncode(txtInput.Text);
Note: here txtInput is the id of TextBox on your page.

You may try like this:
string s = Regex.Replace("Hello friend <br> How are you? <h1> from me </h1>", #"<[^>]+>| ", "").Trim();
You can also check the HTML Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
<[^>]+>| /
1st Alternative: <[^>]+>
< matches the characters < literally
[^>]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
> a single character in the list > literally (case sensitive)
> matches the characters > literally
2nd Alternative:
matches the characters literally (case sensitive)

Related

Breakdown of HTML RTF string for 3rd Party Formatting

I have decided to come here with my problem as my head is fried and I have a deadline. My basic scenario is that on our system we save RTF HTML in the database, for example:
This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text
Which renders as following:
This is Line 1 with more Bold and italic text
These HTML strings are exported to PDF and up until now the PDF renderer used could read and render this HTML correctly... Not any more. I am therefore having to do this the manual way and read each tag individually and apply the styling on the fly as I construct each paragraph. Fine.
My idea is to build a list of strings, for example:
"This is "
"<strong>Line 1</strong>"
" with more "
"<strong>Bold and <em>italic</em></strong>"
" text"
Each row either has an un-formatted string or contains all style tags for a given string.
I should then be able to build up my paragraph one string at a time, checking for tags and applying them when required.
I am however mentally failing at the first hurdle (Friday afternoon syndrome??) and cannot figure out how to build my list. I'm guessing I am going to use RegEx.
If someone is able to advise on how I might be able to get a list like this would be greatly appreciated.
Edit
Following a Python example suggested below I have implemented the following, but this only gives me the elements surrounded by tags and none of the unformatted text:
var stringElements = Regex.Matches(paragraphString, #"(<(.*?)>.*?</\2>)", RegexOptions.Compiled)
.Cast<Match>()
.Select(m => m.Value)
.ToList();
So close...
I apologize up front, since my answer is written in Python, however I hope this provides you with some guidance.
import re
s = 'This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text'
matches = [i[0] for i in re.findall(r'(<(.*?)>.*?</\2>)', s)]
for i in matches:
s = s.replace(i, '\n' + i + '\n')
print(s)
Gives:
This is
<strong> Line 1</strong>
with more
<strong>Bold and <em>italic</em></strong>
text
So I have found a solution by using the glorious Html Agility Pack:
var doc = new HtmlDocument();
doc.LoadHtml(paragraphString);
var htmlBody = doc.DocumentNode.SelectSingleNode(#"/p");
HtmlNodeCollection childNodes = htmlBody.ChildNodes;
List<string> elements = new List<string>();
foreach (var node in childNodes)
{
elements.Add(node.OuterHtml);
}
As a note, I was previously removing the Paragraph tags surrounding the html from the paragraphString but have left them in for this example. So the string being passed in is actually:
<p>This is<strong> Line 1</strong> with more <strong>Bold and <em>italic</em></strong> text</p>
I think the RegEx answer has some credibility and I am sure there is something in there that is just excluding the non 'noded' elements. This seems nicer though as you have access to the elements in a class-structure kinda way.

C# Regex pattern for finding a tag in a string

For the below string,I want to select only the inner script tag containing the url http://cdn.walkme.com/users and replace the selected tag with an empty string so can somebody help me with the regex pattern
<script><script type="text/javascript">(function() {var walkme = document.createElement('script'); walkme.type = 'text/javascript'; walkme.async = true; walkme.src='http://cdn.walkme.com/users/cb643dab0d6f4c7cbc9d436e7c06f719/walkme_cb643dab0d6f4c7cbc9d436e7c06f719.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(walkme, s); window._walkmeConfig = {smartLoad:true}; })();</script></script>
I have tried this < script(.+)http://cdn.walkme.com/users/.+?\/script>
I agree that it's not really possible to have comprehensive and generic regex to parse any (x)HTML which standard supports. That's is true just by nature of these things.
But you're perfectly fine to do lots of smaller cool tasks using Regex. Just like in your case, in order to strip particular script out of the page markup, you could just use the following regex to find an entry and then replace it with an empty string:
\<script\>\<script type="text/javascript"\>\(function\(\) \{var walkme =.*\</script\>
It does very a simple thing - takes everything in between
<script><script type="text/javascript">(function() {var walkme =
(you can include more text to be more specific) and
</script>
Just ensure special symbols (like /, ( or )) are escaped properly.
Edited
In order to select inner need to use what is called positive lookahead to find first closing tag right after opening one:
<script type="text/javascript">\(function\(\) {var walkme =.*(?=</script>)

How do I find a HTML div contains specific text after a text prefix?

I have following string:
<div> text0 </div> prefix <div> text1 <strong>text2</strong> text3 </div> text4
and want to know wether it contains text3 inside divs that go after prefix:
prefix<div>...text3...</div>
but I don't know how ta make regex for that, since I can't use [^<]+ because div's can contain strong tag inside.
Please help
EDIT:
Div tags after prefix are guaranted to be not nested
Language is C#
Text4 is very long, so regex must not look after closing div
EDIT2: I don't want to use html parser, it can be easily (and MUCH faster) achieved with Regex. HTML there is simple: no attributes in tags; no nesting div's. And even some % of wrong answers are acceptable in my case.
If you turn off the "greedy" option, you should be able to just use something like prefix<div>.*text3.*</div>. (If the <div> is allowed to have attributes, use prefix<div[^>]*>.*text3.*</div> instead.)
Numerous improvements could be made to this in order to take account of unusual spacing, >s within quotes, </div> within quotes, etc.
Patterns like prefix<div>...<div></div>text3</div> would be more difficult. You might have to capture all of the occurrences of the div tag so that you could count how many div tags were open at a given time.
EDIT: Oops, turning off the greedy option won't always give the right result, even in examples other than the one above. Probably better just to capture all occurrences of the div tag and go from there. As noted above by Peter, HTML is not a regular language and so you can't use regular expressions to do everything you might want with it.
this is my new regex:
prefix<div>([^<]*<(?!/div>))*[^<]*text3([^<]*<(?!/div>))*[^<]*</div>
seems to work ok.
For C# + HtmlAgilityPack you can do something like:
InputString = Regex.Replace(InputString,"^(?:[^<]+?|<[^>]*>)*?prefix","");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(InputString);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//div[contains('text3')]");
The prefix removal is still not a good way of dealing with it. Ideally you'd do something like using HtmlAgilityPack to find where prefix occurs in the DOM, translate that to provide the position in the string, then do a substring(pos,len) (or equivalent) to look at only the relevant text (you can also avoid looking at text4 using similar method).
I'm afraid I can't translate all that into code right now; hopefully someone else can help there.
(original answer, before extra details provided)
Here is a JavaScript + jQuery solution:
var InputString = '<div>text0 </div> prefix <div>text1 <strong>text2</strong> text3 </div> text4';
InputString = InputString.replace(/^.*?prefix/,'');
var MatchingDivs = jQuery('div:contains(text3)','<div>'+InputString+'</div>')
console.log(MatchingDivs.get());
This makes use of jQuery's ability to accept a context as second argument (though it appears this needs to be wrapped in div tags to actually work).

Search and replace non-HTML content

I need to write some code that will search and replace whole words in a string that are outside HTML tags. So if I have this string:
string content = "the brown fox jumped over <b>the</b> lazy dog over there";
string keyword = "the";
I need to something like:
if (content.ToLower().Contains(keyword.ToLower()))
content = content.Replace(keyword, String.Format("<span style=\"background-color:yellow;\">{0}</span>", keyword));
but I don't want to replace the "the" in the bold tags or the "the" in "there", just the first "the".
you can use this library to parse you html and to replace only the words that are not in any html, to replace only the word "the" and not "three" use RegEx.Replace("the\s+"...) instead of string replace
Try this:
content = RegEx.Replace(content, "(?<!>)"
+ keyword
+ "(?!(<|\w))", "<span blah...>" + keyword + '</span>';
Edit: I fixed the "these" case, but not the case where more than the keyword is wrapped in HTML, e.g., "fox jumped over the lazy dog."
What you're asking for is going to be nearly impossible with RegEx and normal, everyday HTML, because to know if you're "inside" a tag, you would have to "pair" each start and end tag, and ignore tags that are intended to be self-closing (BR and IMG, for instance).
If this is merely eye candy for a web site, I suggest going the other route: fix your CSS so the SPAN you are adding only impacts the HTML outside of a tag.
For example:
content = content.Replace("the", "<span class=\"highlight\">the</span>");
Then, in your CSS:
span.highlight { background-color: yellow; }
b span.highlight,
i span.highlight,
em span.highlight,
strong span.highlight,
p span.highlight,
blockquote span.highlight { background: none; }
Just add an exclusion for each HTML tag whose contents should not be highlighted.
I like the suggestion to use an HTML parser, but let me propose a way to enumerate the top-level text (no enclosing tags) regions, which you can transform and recombine at your leisure.
Essentially, you can treat each top-level open tag as a {, and track the nesting of only that tag. This might be simple enough compared to regular parsing that you want to do it yourself.
Here are some potential gotchas:
If it's not XHTML, you need a list of tags which are always empty:
<hr> , <br> and <img> (are there more?).
For all opening tags, if it ends in />, it's immediately closed - {} rather than {.
Case insensitivity - I believe you'll want to match tag names insensitively (just lc them all).
Super-permissive generous browser interpretations like
"<p> <p>" = "<p> </p><p>" = {}{
Quoted entities are NOT allowed to contain <> (they need to use <), but maybe browsers are super permissive there as well.
Essentially, if you want to parse correct HTML markup, there's no problem.
So, the algorithm:
"end of previous tag" = start of string
repeatedly search for the next open-tag (case insensitive), or end of string:
< *([^ >/]+)[^/>]*(/?) *>|$
handle (end of previous tag, start of match) as a region outside all tags.
set tagname=lc($1). if there was a / ($2 isn't empty), then update end and continue at start. else, with depth=1,
while depth > 0, scan for next (also case insensitive):
< *(/?) *$tagname *(/?) *>
If $1, then it's a close tag (depth-=1). Else if not $2, it's another open tag; depth+=1. In any case, keep looping (back to 1.)
Back to start (you're at top level again). Note that I said at the top "scan for next start of top-level open tag, or end of string", i.e. make sure you process the toplevel text hanging off the last closing tag.
That's it. Essentially, you get to ignore all other tags than the current topmost one you're monitoring, on the assumption that the input markup is properly nested (it will still work properly against some types of mis-nesting).
Also, wherever I wrote a space above, should probably be any whitespace (between < > / and tag name you're allowed any whitespace you like).
As you can see, just because the problem is slightly easier than full HTML parsing, doesn't necessarily mean you shouldn't use a real HTML parser :) There's a lot you could screw up.
You'll need to give more details.
For example:
<p>the brown fox</p>
is technically inside HTML tags.

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Categories