When's an Apostrophe not an Apostrophe - validation .Net / Javascript

When's an Apostrophe not an Apostrophe - validation .Net / Javascript - c#

I have an regular expression validator for emails in .NET 2.0 which uses client side validation (javascript).
The current expression is "\w+([-+.']\w+)#\w+([-.]\w+).\w+([-.]\w+)" which works for my needs (or so I thought).
However I was getting a problem with apostrophes as I had copy/pasted an email address from Outlook into the forms text field
Chris.O’Brian#somerandomdomain.com
You can see the apostrophe is a different character from what get if I were just to type into a text box
' vs ’ - but both are apostrophes
Okay I thought , lets just add in this character into the validation string so I get
"\w+([-+.'’]\w+)#\w+([-.]\w+).\w+([-.]\w+)"
I copy paste the "special" apostrophe into the validation expression, then I type the email and use the same clipboard item to paste the apostrophe but the validation still fails.
The apostrophe doesn't look the same in the .net code behind file as the .net form and because the validation is still failing , I am presuming it's being considered a different character because of some sort of encoding of the .cs source file?
Does this sound plausible, has someone else encountered the same problem?
Thanks

You should add a '+' after ([-+.'`]\w+), to allow for multiple groups of 'words'. The expression you gave only allows for two words, and you have three: Chris, O, Brian.
Hope this makes things clearer.

There will be a tendency in something like Outlook to use 'Smart Quotes'
Here's some background information

If you just pasted the ’ (U+2019 RIGHT SINGLE QUOTATION MARK) into your document and it didn't work it means that your document does not use unicode.
When you encode and send the file as UTF-8 (for example) it works just fine without further modifications. Otherwise you have to escape it via \u2019 which also works in JavaScript's regular expressions:
"\w+([-+.'\u2019]\w+)#\w+([-.]\w+).\w+([-.]\w+)"

In XML you could test the value of an apostrophe character by evaluating it against its character entity reference:
&apos;
That entity does not exist in the SGML form of HTML, however. And as an added bonus JavaScript cannot compare a single quote to a double quote. When compared they evaluated to true. The only solution there is to convert single quote and double quote characters to a character entity reference of your invention, perform the comparison, and then replace those invented entity references with the proper quote characters.

Related

Converting "bad" characters to their equivalent without a direct string.Replace and a list

I have done my research and everything I've found either does nothing or is too Leeroy Jenkins and replaces everything else that it shouldn't. It's possible that I'm phrasing everything wrong in my search and so coming up with nothing.
I have to replace all the wrong characters that rich text programs (and older programs) autocorrect for the user because the user then copy/pasts directly into a web form.
For example, the "funky" apostrophe (’) converted to the regular apostrophe (') and the quotation marks and everything else.
I've tried UTF en/decoding, diacritic removal (not at all what I need), and a direct brute force string.Replace isn't reasonable, really.
Here's some example text that has all the bad stuff:
"They’re taking the hobbits to Isengaurd with bad apostrophe’s instead of good one's. Itâ€™s just how they roll."
Note that the only good apostrophe is in one's and already have one rendered result of this (Itâ€™s) so I need to convert it back (along with all the other baddies) without a string.Replace and a list of characters to watch for.
What ought I be doing here?
To clarify: I need to convert the bad characters to good equivalents before data is submitted AND I need to catch existing stuff that was rendered after it was saved. So I need to do two things here.

Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas

I have lots of CSV data that I am trying to decode using regex. I am actually tried to build on an existing code base that other people/projects hit and dont want to risk breaking their data flows by refactoring the class too much. So, I was wondering if it is possible to decode this text with a single regex (which is how the class works currently):
f1,f2,f3,f4,f5,f6,f7
,"clean text","with,embedded,commas.","with""embedded""double""quotes",,"6.1",
First row is the header. If I save this as xxx.csv and open in Excel, it properly decompiles it to read (note the space between the fields are the cell breaks):
f1 f2 f3 f4 f5 f6 f7
clean text with,embedded,commas. with"embedded"double"quotes 6.1
But when I try this in .net, I get stuck on the regex. I have this:
string regExp = "(((?<x>(?=[,\\r\\n]+))|\"(?<x>([^\"]|\"\")+)\"|(?<x>[^,\\r\\n]+)),?)";
You can see it in action here:
http://ideone.com/hRq8xe
Which results in this:
<start>
clean text
with,embedded,commas.
with""embedded""double""quotes
6.1
<end>
This is very close but it does not replace the escaped double-double quotes with a single-double quote like Excel does. I could not come up with a regex that worked better. Can it be done?

Maybe you can somehow manage to match your string using regular-expression-conditionals with the following constructors:
if-then sentence(?(?=regex)then|else)
multiple if-then sentences(?(?=condition)(then1|then2|then3)|(else1|else2|else3))
I came up with the following pattern in order to match the body of your text: ([^\,]+(?(?=[^\,])([^\"]+")|([^\,]+,))), however, you will need to put an extra effort in order to create a completly matching expression for your text or end up using a file parser. If so, You can take a look at FileHelpers, a pretty neat library for parsing text files.
Sources:
Regular Expression Conditionals
Alternation Constructs in Regular Expressions

How do you correctly escape a document name in .NET?

We store a bunch of weird document names on our web server (people upload them) that have various characters like spaces, ampersands, etc. When we generate links to these documents, we need to escape them so the server can look up the file by its raw name in the database. However, none of the built in .NET escape functions will work correctly in all cases.
Take the document Hello#There.docx:
UrlEncode will handle this correctly:
HttpUtility.UrlEncode("Hello#There");
"Hello%23There"
However, UrlEncode will not handle Hello There.docx correctly:
HttpUtility.UrlEncode("Hello There.docx");
"Hello+There.docx"
The + symbol is only valid for URL parameters, not document names. Interestingly enough, this actually works on the Visual Studio test web server but not on IIS.
The UrlPathEncode function works fine for spaces:
HttpUtility.UrlPathEncode("Hello There.docx");
"Hello%20There.docx"
However, it will not escape other characters such as the # character:
HttpUtility.UrlPathEncode("Hello#There.docx");
"Hello#There.docx"
This link is invalid as the # is interpreted as a URL hash and never even gets to the server.
Is there a .NET utility method to escape all non-alphanumeric characters in a document name, or would I have to write my own?

Have a look at the Uri.EscapeDataString Method:
Uri.EscapeDataString("Hello There.docx") // "Hello%20There.docx"
Uri.EscapeDataString("Hello#There.docx") // "Hello%23There.docx"

I would approach it a different way: Do not use the document name as key in your look-up - use a Guid or some other id parameter that you can map to the document name on disk in your database. Not only would that guarantee uniqueness but you also would not have this problem of escaping in the first place.

You can use # character to escape strings. See the below pieces of code.
string str = #"\n\n\n\n";
Console.WriteLine(str);
Output: \n\n\n\n
string str1 = #"\df\%%^\^\)\t%%";
Console.WriteLine(str1);
Output: \df\%%^\^)\t%%
This kind of formatting is very useful for pathnames and for creating regexes.

Html encode problem for server side strings

I am trying to do html encode on the below string which has quotes , buts it not working
The server returns with quotes for the string
string serverString= **“Test hello,”** // this is returned from database
serverString =HttpUtility.HtmlEncode(serverString);
i am getting this result
�Test helloI,�
but still its not replacing and i am getting some diamond symbols on the asp.net page
Can anybody tell me what am i doing wrong.

The quote characters you're seeing are perfectly legitimate characters from an HTML standpoint, so they don't need to be encoded by HtmlEncode. What you're most likely seeing is an issue with your browser's encoding not supporting those characters. See http://www.htmlbasictutor.ca/character-encoding.htm for more information.

Are you sure it's not a rendering issue? You might try a font like "Arial Unicode MS" to make sure the browser is rendering the characters properly.
You should also verify the string returned from the database is correct.
Lastly, it could help to share how you're writing serverString to your response stream. Some ASP.NET controls expect text and HTML-encode for you while others expect HTML and do not.

This is because the server is returning fancy double quotes (that's not the technical name for them) instead of regular double quotes. You could do something like this:
string serverString= "“Test hello,”";
serverString = HttpUtility.HtmlEncode(serverString)
// Replaces fancy left double quote with regular one
.Replace("\u2018", "'")
// Replaces fancy right double quote with regular one
.Replace("\u2019", "'");

PHPs htmlspecialcharacters equivalent in .NET?

PHP has a great function called htmlspecialcharacters() where you pass it a string and it replaces all of HTML's special characters with their safe equivalents, it's almost a one stop shop for sanitizing input. Very nice right?
Well is there an equivalent in any of the .NET libraries?
If not, can anyone link to any code samples or libraries that do this well?

Try this.
var encodedHtml = HttpContext.Current.Server.HtmlEncode(...);

System.Web.HttpUtility.HtmlEncode(string)

Don't know if there's an exact replacement, but there is a method HtmlUtility.HtmlEncode that replaces special characters with their HTML equivalents. A close cousin is HtmlUtility.UrlEncode for rendering URL's. You could also use validator controls like RegularExpressionValidator, RangeValidator, and System.Text.RegularExpression.Regex to make sure you're getting what you want.

Actually, you might want to try this method:
HttpUtility.HtmlAttributeEncode()
Why? Citing the HtmlAttributeEncode page at MSDN docs:
The HtmlAttributeEncode method converts only quotation marks ("), ampersands (&), and left angle brackets (<) to equivalent character entities. It is considerably faster than the HtmlEncode method.

In an addition to the given answers:
When using Razor view engine (which is the default view engine in ASP.NET), using the '#' character to display values will automatically encode the displayed value. This means that you don't have to use encoding.
On the other hand, when you don't want the text being encoded, you have to specify that explicitly (by using #Html.Raw). Which is, in my opinion, a good thing from a security point of view.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

When's an Apostrophe not an Apostrophe - validation .Net / Javascript - c#

You should add a '+' after ([-+.'`]\w+), to allow for multiple groups of 'words'. The expression you gave only allows for two words, and you have three: Chris, O, Brian. Hope this makes things clearer.

There will be a tendency in something like Outlook to use 'Smart Quotes' Here's some background information

Related

Converting "bad" characters to their equivalent without a direct string.Replace and a list

Single RegEx expressiong to decode CSV with embedded dobule quotes and Commas

How do you correctly escape a document name in .NET?

Html encode problem for server side strings

PHPs htmlspecialcharacters equivalent in .NET?

Categories

Resources