Html.Encode seems to simply call HttpUtility.HtmlEncode to replace a few html specific characters with their escape sequences.
However this doesn't provide any consideration for how new lines and multiple spaces will be interpretted (markup whitespace). So I provide a text area for the a user to enter a plain text block of information, and then later display that data on another screen (using Html.Encode), the new lines and spacing will not be preserved.
I think there are 2 options, but maybe there is a better 3rd someone can suggest.
One option would be to just write a static method that uses HtmlEncode, and then replaces new lines in the resulting string with <br> and groups of multiple spaces with
Another option would be to mess about with the white-space: pre attribute in my style sheets - however I'm not sure if this would produce side effects when Html helper methods include new lines and tabbing to make the page source pretty.
Is there a third option, like a global flag, event or method override I can use to change how html encoding is done without having to redo the html helper methods?
HtmlEncode is only meant to encode characters for display in HTML. It specifically does not encode whitespace characters.
I would go with your first option, and make it an extension method for HtmlHelper. Something like:
public static string HtmlEncode(this HtmlHelper htmlHelper,
string text,
bool preserveWhitespace)
{
// ...
}
You could use String.Replace() to encode the newlines and spaces (or Regex.Replace if you need better matching).
Using the style="white-space:pre-wrap;" worked for me. Per this article.
If you use Razor you can do:
#MvcHtmlString.Create(Html.Encode(strToEncode).Replace(Environment.NewLine, "<br />"))
in your view, or in your controller:
HttpServerUtility httpUtil = new HttpServerUtility();
MvcHtmlString encoded = httpUtil.HtmlEncode(strToEncode).Replace(Environment.NewLine, "<br />");
I have not tested the controller method, but it should work the same way.
Put your output inside <pre></pre> and/or <code></code> blocks. E.g:
<pre>#someValue</pre> / <code>#someValue</code>
Use the equivalent css on an existing div:
<div style="white-space:pre-wrap;">#someValue</div>
Depends whether you want the semantic markup or whether you want to fiddle with css. I think these are both neater than inserting <br/> tags.
/// <summary>
/// Returns html string with new lines as br tags
/// </summary>
public static MvcHtmlString ConvertNewLinesToBr<TModel>(this HtmlHelper<TModel> html, string text)
{
return new MvcHtmlString(html.Encode(text).Replace(Environment.NewLine, "<br />"));
}
Related
I have a controller which generates a string containing html markup. When it displays on views, it is displayed as a simple string containing all tags.
I tried to use an Html helper to encode/decode to display it properly, but it is not working.
string str= "seeker has applied to Job floated by you.</br>";
On my views,
#Html.Encode(str)
You are close you want to use #Html.Raw(str)
#Html.Encode takes strings and ensures that all the special characters are handled properly. These include characters like spaces.
You should be using IHtmlString instead:
IHtmlString str = new HtmlString("seeker has applied to Job floated by you.</br>");
Whenever you have model properties or variables that need to hold HTML, I feel this is generally a better practice. First of all, it is a bit cleaner. For example:
#Html.Raw(str)
Compared to:
#str
Also, I also think it's a bit safer vs. using #Html.Raw(), as the concern of whether your data is HTML is kept in your controller. In an environment where you have front-end vs. back-end developers, your back-end developers may be more in tune with what data can hold HTML values, thus keeping this concern in the back-end (controller).
I generally try to avoid using Html.Raw() whenever possible.
One other thing worth noting, is I'm not sure where you're assigning str, but a few things that concern me with how you may be implementing this.
First, this should be done in a controller, regardless of your solution (IHtmlString or Html.Raw). You should avoid any logic like this in your view, as it doesn't really belong there.
Additionally, you should be using your ViewModel for getting values to your view (and again, ideally using IHtmlString as the property type). Seeing something like #Html.Encode(str) is a little concerning, unless you were doing this just to simplify your example.
you can use
#Html.Raw(str)
See MSDN for more
Returns markup that is not HTML encoded.
This method wraps HTML markup using the IHtmlString class, which
renders unencoded HTML.
I had a similar problem with HTML input fields in MVC. The web paged only showed the first keyword of the field.
Example: input field: "The quick brown fox" Displayed value: "The"
The resolution was to put the variable in quotes in the value statement as follows:
<input class="ParmInput" type="text" id="respondingRangerUnit" name="respondingRangerUnit"
onchange="validateInteger(this.value)" value="#ViewBag.respondingRangerUnit">
I had a similar problem recently, and google landed me here, so I put this answer here in case others land here as well, for completeness.
I noticed that when I had badly formatted html, I was actually having all my html tags stripped out, with just the non-tag content remaining. I particularly had a table with a missing opening table tag, and then all my html tags from the entire string where ripped out completely.
So, if the above doesn't work, and you're still scratching your head, then also check you html for being valid.
I notice even after I got it working, MVC was adding tbody tags where I had none. This tells me there is clean up happening (MVC 5), and that when it can't happen, it strips out all/some tags.
I want to have the following result. Username has to be bold:
Blabla Username Bla.
I have the Format in a ressource file:
Blabla {0} Bla.
And in the view I do the following:
#Html.FormatValue(User.Identity.Name, Resources.MyFormatString)
How can I make the Username bold and use Html.FormatValue? Or is there another method to achieve this?
You could simply change your resource to contain the bold-tag, strong-tag or a style.
Like "Blabla <b>{0}</b> Bla.".
[edit]
Indeed, checked Html.FormatValue for an escape functionality, did not see one, but apparently it does :)
In that case using #Html.Raw and string.Format will work.
#Html.Raw(string.Format(Resources.MyFormatString, "SomeName"))
(tested in MVC 5, but #Html.Raw is also available in 4)
Also a small note: storing HTML in resources is probably not the best idea, mixing UI & content.
[/edit]
I wanted to solve your example with including html tags, be safe with html characters in the resources, and safely include user input or html tags. My solution of your example is
#(Resources.MyFormatString.FormatWithHtml(
"<b>" + HttpUtility.HtmlEncode(User.Identity.Name) + "</b>"))
using my function FormatWithHtml
/// Encodes to MvcHtmlString and includes HTML tags or already encoded strings, placeholder is the '|' character
public static MvcHtmlString FormatWithHtml (this string format, params string[] htmlIncludes)
{
var result = new StringBuilder();
int i = -1;
foreach(string part in format.Split('|')) {
result.Append(HttpUtility.HtmlEncode(part));
if (++i < htmlIncludes.Length)
result.Append(htmlIncludes[i]);
}
return new MvcHtmlString(result.ToString());
}
One more example, this
#("Resource is safe to html characters <&> and will include |format tags| or any | user input."
.FormatWithHtml("<b>", "</b>", "<b id='FromUser'>" +HttpUtility.HtmlEncode("<a href='crack.me'>click</a>") +"</b>"))
will write to your razor page
Resource is safe to html characters <&> and will include format tags or any <a href='crack.me'>click</a> user input.
I currently have an extension method from removing any HTML from strings.
Regex.Replace(s, #"<(.|\n)*?>", string.Empty);
This works fine on the whole, however, I am occasionally getting passed strings that have both standard HTML markup within them, along with encoded markup (I don't have control of the source data so can't correct things at the point of entry), e.g.
<p><p>Sample text</p></p>
I need an expression that will remove both encoded and non-encoded HTML (whether it be paragraph tags, anchor tags, formatting tags etc.) from a string.
I think you can do that in two passes with your same Extension method.
First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple
I'm building a Ajax.ActionLink in C# which starts:
<%= Ajax.ActionLink("f lastname", ...more stuff
and I'd like there to be a new line character between the words "f" and "lastname". How can I accomplish this? I thought the special character was \n but that doesn't work, and <br> doesn't work either.
You might have to revert to doing something like:
f<br />last
And then wire in the Ajax bits manually.
Try this:
<%= Ajax.ActionLink("f<br />lastname", ...more stuff
You can't use <br /> because the ActionLink method (and indeed I believe all the html and ajax extension methods) encode the string. Thus, the output would be something like
f<br />lastname
What you could try instead would be a formatting:
<%= string.Format(Ajax.ActionLink("f{0}lastname", ...more stuff), "<br />") %>
Did you try the \r\n combination?
How about:
<%= Server.UrlDecode(Ajax.ActionLink(Server.UrlEncode("f<br/>lastname"), ...more stuff
This works for me -
<%= HttpUtility.HtmlDecode(Html.ActionLink("AOT <br/> Claim #", "actionName" ))%>
The \n used to work for me. But now it seems to be depricated. Alternitavely, you may use the NewLine method, for example:
string jay = "This is a" + Environment.NewLine + "multiline" + Environment.NewLine + "statement";
I think Andrew Hare's answer is correct. If you have slightly more complicated requirement, you do have the option to create your own AjaxHelper or HtmlHelper. This will involve creating custom extension methods that work on AjaxHelper and HtmlHelpers, by doing something like:
public static class CustomHtmlHelperExtensions
{
public static MvcHtmlString FormattedActionLink(this HtmlHelper html, ...)
{
var tagBuilder = new TagBuilder("a");
// TODO : Implementation here
// this syntax might not be exact but you get the jist of it!
return MvcHtmlString.Create(tagBuilder.ToString());
}
}
You can use dotPeek or your favorite .NET reflection tool to examine the standard extensions that come with ASP.NET MVC (e.g., ActionLink) etc to find how Microsoft has implemented most of those extension methods. They have some pretty good patterns for writing those. In the past, I have taken this approach to simplify outputting HTML in a readable manner, such as, for Google Maps or Bing Maps integration, for creating options like ActionImage e.g., #Html.ActionImage(...) or to integrate outputting Textile-formatting HTML by enabling syntax such as #Html.Textile("textile formatted string").
If you define this in a separate assembly (like I do), then remember to include this into your project references and then add it to the project's Web.config as well.
Obviously, this approach is overkill for your specific purposes, and for this reason, my vote is for Andrew Hare's approach for your specific case.
It's been several years since the question was asked, but I had trouble with it. I found the answer to be (in MVC):
Text in your ActionLink: ...ActionLink("TextLine1" + Environment.Newline + "TextLine2", ...
In the ActionLink, have a class that points to a css with this line:
whitespace: pre;
That's it. I've seen answers where they put the entire Actionline in < pre > < /pre > tags, but that caused more problems than it solved.
How do I use C# regular expression to replace/remove all HTML tags, including the angle brackets?
Can someone please help me with the code?
As often stated before, you should not use regular expressions to process XML or HTML documents. They do not perform very well with HTML and XML documents, because there is no way to express nested structures in a general way.
You could use the following.
String result = Regex.Replace(htmlDocument, #"<[^>]*>", String.Empty);
This will work for most cases, but there will be cases (for example CDATA containing angle brackets) where this will not work as expected.
The correct answer is don't do that, use the HTML Agility Pack.
Edited to add:
To shamelessly steal from the comment below by jesse, and to avoid being accused of inadequately answering the question after all this time, here's a simple, reliable snippet using the HTML Agility Pack that works with even most imperfectly formed, capricious bits of HTML:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(Properties.Resources.HtmlContents);
var text = doc.DocumentNode.SelectNodes("//body//text()").Select(node => node.InnerText);
StringBuilder output = new StringBuilder();
foreach (string line in text)
{
output.AppendLine(line);
}
string textOnly = HttpUtility.HtmlDecode(output.ToString());
There are very few defensible cases for using a regular expression for parsing HTML, as HTML can't be parsed correctly without a context-awareness that's very painful to provide even in a nontraditional regex engine. You can get part way there with a RegEx, but you'll need to do manual verifications.
Html Agility Pack can provide you a robust solution that will reduce the need to manually fix up the aberrations that can result from naively treating HTML as a context-free grammar.
A regular expression may get you mostly what you want most of the time, but it will fail on very common cases. If you can find a better/faster parser than HTML Agility Pack, go for it, but please don't subject the world to more broken HTML hackery.
The question is too broad to be answered definitively. Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to:
remove the <!DOCTYPE declaration or <?xml prolog if they exist
remove all SGML comments
remove the entire HEAD element
remove all SCRIPT and STYLE elements
do Grabthar-knows-what with FORM and TABLE elements
remove the remaining tags
remove the <![CDATA[ and ]]> sequences from CDATA sections but leave their contents alone
That's just off the top of my head--I'm sure there's more. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.
But, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use:
#"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"
Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does; the first alternative handles all of that.
In case you're wondering about those (?>...) constructs, they're atomic groups. They make the regex a little more efficient, but more importantly, they prevent runaway backtracking, which is something you should always watch out for when you mix alternation and nested quantifiers as I've done. I don't really think that would be a problem here, but I know if I don't mention it, someone else will. ;-)
This regex isn't perfect, of course, but it's probably as good as you'll ever need.
Regex regex = new Regex(#"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);
Source
#JasonTrue is correct, that stripping HTML tags should not be done via regular expressions.
It's quite simple to strip HTML tags using HtmlAgilityPack:
public string StripTags(string input) {
var doc = new HtmlDocument();
doc.LoadHtml(input ?? "");
return doc.DocumentNode.InnerText;
}
I would like to echo Jason's response though sometimes you need to naively parse some Html and pull out the text content.
I needed to do this with some Html which had been created by a rich text editor, always fun and games.
In this case you may need to remove the content of some tags as well as just the tags themselves.
In my case and tags were thrown into this mix. Some one may find my (very slightly) less naive implementation a useful starting point.
/// <summary>
/// Removes all html tags from string and leaves only plain text
/// Removes content of <xml></xml> and <style></style> tags as aim to get text content not markup /meta data.
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
public static string HtmlStrip(this string input)
{
input = Regex.Replace(input, "<style>(.|\n)*?</style>",string.Empty);
input = Regex.Replace(input, #"<xml>(.|\n)*?</xml>", string.Empty); // remove all <xml></xml> tags and anything inbetween.
return Regex.Replace(input, #"<(.|\n)*?>", string.Empty); // remove any tags but not there content "<p>bob<span> johnson</span></p>" becomes "bob johnson"
}
try regular expression method at this URL: http://www.dotnetperls.com/remove-html-tags
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
use this..
#"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"
Add .+? in <[^>]*> and try this regex (base on this):
<[^>].+?>
c# .net regex demo
Use this method to remove tags:
public string From_To(string text, string from, string to)
{
if (text == null)
return null;
string pattern = #"" + from + ".*?" + to;
Regex rx = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
MatchCollection matches = rx.Matches(text);
return matches.Count <= 0 ? text : matches.Cast<Match>().Where(match => !string.IsNullOrEmpty(match.Value)).Aggregate(text, (current, match) => current.Replace(match.Value, ""));
}