Format/Safe string for "title" attribute in anchor - c#

I have a function that builds an anchor tag. The function recieves the URL, Title as parameters. The problem is that sometime the text includes quotation marks and this results in a anchor tag generated with syntax errors.
Which is the best way to solve this problems? Is there any function that parses the text into a safe string, in this case, for the title attribute.
Otherwise I can check the string and strip all quotation marks, but I would like know if there is a better way to do this, e.g there might be some other characters that can crash my function as well.

Actually you want to use HttpUtility.HtmlAttributeEncode to encode your title attribute. The other encoders will do more work (and have different uses) whereas this one only escapes ", &, and < to generate a valid text for an attribute.
Example:
This is a <"test"> & something else. becomes This is a <"Test"> & something else.

Related

Standard token for encoding parentheses for HTML Ids?

I have to encode strings to remove parentheses for Ids for HTML elements.
Parentheses (these ones (,)) aren't valid in HTML Ids, are there standard strings (like those used in URLs) to use?
Is there an existing method that can be used in ASP.NET MVC?
N.B. System.Web.Mvc.HttpUtility.HtmlEncode(string), does not encode parentheses.
As per the HTML specification (and this question about id's) parentheses aren't allowed in the HTML id attribute. If you need them, you could use string replace, e.g.:
// ( = 'op--' Opening Parenthesis
// ) = 'cp--' Closing Parenthesis
string id = "collectionName.get_Item(index)";
// encode
string encodedId = id.Replace("(", "op--").Replace(")", "cp--");
// decode
string decodedId = encodedId.Replace("op--", "(").Replace("cp--", ")");
I don't think I understand the question, cos it feels like the answer is to substitute [ and ]. Or even %28 and %29 from the Wikipedia link you gave.
Have I got hold of the wrong end of the stick?
EDIT: From what has been said in the comments, it seems that %28 and %29 are not okay as the % character is also invalid, in which case you could select a substitute that won't appear elsewhere.
EG Something like ( becomes ---28--- (or even ---openbracket---) or something else you can guarantee won't appear elsewhere in the ID (which should be possible).
If the elements are dynamically created then why not just do a .Replace on the id changing parentheses for, say, underscores?
If the elements are not dynamically created then why do they have parentheses in the ids?!

Remove Encoded HTML from Strings using RegEx

I currently have an extension method from removing any HTML from strings.
Regex.Replace(s, #"<(.|\n)*?>", string.Empty);
This works fine on the whole, however, I am occasionally getting passed strings that have both standard HTML markup within them, along with encoded markup (I don't have control of the source data so can't correct things at the point of entry), e.g.
<p><p>Sample text</p></p>
I need an expression that will remove both encoded and non-encoded HTML (whether it be paragraph tags, anchor tags, formatting tags etc.) from a string.
I think you can do that in two passes with your same Extension method.
First Replace the usual un-encoded tags then Decode the returned string and do it again. Simple

Remove anchor from URL in C#

I'm trying to pull in an src value from an XML document, and in one that I'm testing it with, the src is:
<content src="content/Orwell - 1984 - 0451524934_split_2.html#calibre_chapter_2"/>
That creates a problem when trying to open the file. I'm not sure what that #(stuff) suffix is called, so I had no luck searching for an answer. I'd just like a simple way to remove it if possible. I suppose I could write a function to search for a # and remove anything after, but that would break if the filename contained a # symbol (or can a file even have that symbol?)
Thanks!
If you had the src in a string you could use
srcstring.Substring(0,srcstring.LastIndexOf("#"));
Which would return the src without the #. If the values you are retreiving are all web urls then this should work, the # is a bookmark in a url that takes you to a specific part of the page.
You should be OK assuming that URLs won't contain a "#"
The character "#" is unsafe and should
always be encoded because it is used in World Wide Web and in other
systems to delimit a URL from a fragment/anchor identifier that might
follow it.
Source (search for "#" or "unsafe").
Therefore just use String.Split() with the "#" as the split character. This should give you 2 parts. In the highly unlikely event it gives more, just discard the last one and rejoin the remainder.
From Wikipedia:
# is used in a URL of a webpage or other resource to introduce a "fragment identifier" – an id which defines a position within that resource. For example, in the URL http://en.wikipedia.org/wiki/Number_sign#Other_uses the portion after the # (Other_uses) is the fragment identifier, in this case indicating that the display should be moved to show the tag marked by ... in the HTML
It's not safe to remove de anchor of the url. What I mean is that ajax like sites make use of the anchor to keep track of the context. For example gmail. If you go to http://www.gmail.com/#inbox, you go directly to your inbox, but if you go to http://www.gmail.com/#all, you'll go to all your mail.
The server can give a different response based on the anchor, even if the response is a file.

When's an Apostrophe not an Apostrophe - validation .Net / Javascript

I have an regular expression validator for emails in .NET 2.0 which uses client side validation (javascript).
The current expression is "\w+([-+.']\w+)#\w+([-.]\w+).\w+([-.]\w+)" which works for my needs (or so I thought).
However I was getting a problem with apostrophes as I had copy/pasted an email address from Outlook into the forms text field
Chris.O’Brian#somerandomdomain.com
You can see the apostrophe is a different character from what get if I were just to type into a text box
' vs ’ - but both are apostrophes
Okay I thought , lets just add in this character into the validation string so I get
"\w+([-+.'’]\w+)#\w+([-.]\w+).\w+([-.]\w+)"
I copy paste the "special" apostrophe into the validation expression, then I type the email and use the same clipboard item to paste the apostrophe but the validation still fails.
The apostrophe doesn't look the same in the .net code behind file as the .net form and because the validation is still failing , I am presuming it's being considered a different character because of some sort of encoding of the .cs source file?
Does this sound plausible, has someone else encountered the same problem?
Thanks
You should add a '+' after ([-+.'`]\w+), to allow for multiple groups of 'words'. The expression you gave only allows for two words, and you have three: Chris, O, Brian.
Hope this makes things clearer.
There will be a tendency in something like Outlook to use 'Smart Quotes'
Here's some background information
If you just pasted the ’ (U+2019 RIGHT SINGLE QUOTATION MARK) into your document and it didn't work it means that your document does not use unicode.
When you encode and send the file as UTF-8 (for example) it works just fine without further modifications. Otherwise you have to escape it via \u2019 which also works in JavaScript's regular expressions:
"\w+([-+.'\u2019]\w+)#\w+([-.]\w+).\w+([-.]\w+)"
In XML you could test the value of an apostrophe character by evaluating it against its character entity reference:
&apos;
That entity does not exist in the SGML form of HTML, however. And as an added bonus JavaScript cannot compare a single quote to a double quote. When compared they evaluated to true. The only solution there is to convert single quote and double quote characters to a character entity reference of your invention, perform the comparison, and then replace those invented entity references with the proper quote characters.

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Categories