Optimistic RegEx Matching for User Text Entry

Optimistic RegEx Matching for User Text Entry - c#

I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?

What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.

Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.

Related

Converting "bad" characters to their equivalent without a direct string.Replace and a list

I have done my research and everything I've found either does nothing or is too Leeroy Jenkins and replaces everything else that it shouldn't. It's possible that I'm phrasing everything wrong in my search and so coming up with nothing.
I have to replace all the wrong characters that rich text programs (and older programs) autocorrect for the user because the user then copy/pasts directly into a web form.
For example, the "funky" apostrophe (’) converted to the regular apostrophe (') and the quotation marks and everything else.
I've tried UTF en/decoding, diacritic removal (not at all what I need), and a direct brute force string.Replace isn't reasonable, really.
Here's some example text that has all the bad stuff:
"They’re taking the hobbits to Isengaurd with bad apostrophe’s instead of good one's. Itâ€™s just how they roll."
Note that the only good apostrophe is in one's and already have one rendered result of this (Itâ€™s) so I need to convert it back (along with all the other baddies) without a string.Replace and a list of characters to watch for.
What ought I be doing here?
To clarify: I need to convert the bad characters to good equivalents before data is submitted AND I need to catch existing stuff that was rendered after it was saved. So I need to do two things here.

Displaying phone format XAML/c#

In my windows phone project, I would like the user to enter his phone number in xxx-xxx-xxxx format. The country code it not required. I tried to implement regex, but i am not getting it right. I just want it to be displayed to the user as he enters it, nothing more, nothing less. This is what I have used
^\(\d{3}\) ?\d{3}( |-)?\d{4}$
But no matter what i put in, I always get this error (in this case 5) "Unrecognized escape sequence". I noticed, this is only with reference to the oblique. When I add a "" after it, the error goes away, but I do not get what I want. Is there a special way to input numbers in the textbox in than manner, on the XAML level?
Thanks in advance!

Put your regex inside verbatim string and also put the space, hyphen inside a group and make it as optional.
#"^\(\d{3}\)([- ]?)\d{3}\1\d{4}$"
DEMO

For testing your RegEx you can use this site: http://www.regexlib.com/RETester.aspx?AspxAutoDetectCookieSupport=1.
For your xxx-xxx-xxxx format I would use it:^\d{3}-?\d{3}-?\d{4}$

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Character / Language

I just developed a simple asp.net mvc application project for English only. I want to block user's any input for a language other than English. Is it possible to know whether user inputs other languages when they write something on textbox or editor in order to give a popup message?

You could limit the input box to latin characters, but there's no automatic way to see if the user entered something in say English, Finnish or Norwegian. They all mostly use a-z. Any character outside of a-z could give you an indication, but certain accents needs to be allowed in English as well, so it's not 100%.
Google Translate exposes a javascript API to detect the language of text.

Use the following code:
<p>Note that this community uses the English language exclusively, so please be
considerate and write your posts in English. Thank you!</p>

there are two tests you can do. one is to find out what the cultureinfo is set on the users machine:
http://msdn.microsoft.com/en-us/library/system.threading.thread.currentuiculture.aspx
this will give you their current culture setting, which is a start. of course, you can have your setting as 'english' but still typing in russian, and most of the letters will be the same..
so the next step is to discover the language using this: http://www.google.com/uds/samples/language/detect.html
it's not the greatest, according to online discussions, but its a place to start. I'm sure there are better natural language identifiers out there, though.

Checking for Latin 26
If you wanted to ensure that any non-English letters were submitted, you could simply validate that they fall outside the A-Z, a-z, 0-9 and normal punctuation ranges. It sounds like you want the regular non-Latin characters to be detected and rejected.
Detecting the user's OS settings, keyboard settings isn't the best way, as the user could have multiple keyboards attached, and have use of copy/paste.
UI Validation
At the user interface level, you could create a jQuery method that would check the value of a textbox for a value other than your acceptable range. Perhaps that's A-Z, a-z and numeric. You could do this on event onBlur. Remember that you might want to allow ', .
$('#customerName').blur(function() {
var isAlphaNumeric;
//implementation of checking a-z, A-Z, 0-9, etc.
alert(isAlphaNumeric);
});
Controller Validation
If you wanted to ALSO implement this at the controller level, you could run a regex on the incoming values.
public ActionMethod CreateCustomer(string custName)
{
if (IsAcceptableRange(custName))
{
//continue
}
}
public bool IsAcceptableRange(string input)
{
//whitelist all the valid inputs here. be sure to include
//space, period, apostrophe, hypen, etc
Regex alphaNumericPattern=new Regex("[^a-zA-Z0-9]");
return !alphaNumericPattern.IsMatch(input);
}

Google Translate was quoted in two answers, but I want to add that Microsoft Word API may also be used to detect language, just like Word does for check spelling.
It is for sure not the best solution, since language detection by Microsoft Office doesn't work very well (IMHO), but may be an alternative if doing web requests to Google or other remote service on every posted message is not a solution.
Also, check spelling through Microsoft Word API can be useful too. If a message has a huge number of misspelled words when checking in English, it's probably because the message is written in another language (or the author of the message writes too badly, too).
Finally, I completely agree with Matti Virkkunen. The best, and maybe the only way to ensure that messages will be written in English is to ask the users to write in English. Otherwise, it's just as bad as implementing obscenity filters.

Capturing Keyboard strokes in C#

HI,
I have the following problem- the following text is in a rich text box .
The world is [[wonderful]] today .
If the user provides two brackets before and afer a word, as in the case of wonderful , the word in brackets, in this case, wonderful shall change to a link, ( with a green colour ) .
I am having problems in getting the sequence of the keystrokes, ie. how do I know that the user has entered [[ , so I can start parsing the rest of the text which follows it .
I can get it by handlng KeyDown, event, and a list , but it does not look to be elegant at all.
Please let me know what should be a proper way.
Thanks,
Sujay

You have two approaches that I can think of off-hand.
One is, as you suggest, maintain the current state with a list—was this key a bracket? was the last key a bracket?—and update on the fly.
The other approach would be to simply handle the TextChanged event and re-scan the text for the [[text-here]] pattern and update as appropriate.
The first requires more bookkeeping but will be much faster for longer text. The second approach is easier and can probably be done with a decent regex, but it will get slower as your text gets longer. If you know you have some upper limit, like 256 characters, then you're probably fine. But if you're expecting novels, probably not such a great idea.

I would recommend Google'ing: "richtextbox syntax highlighter", there are so many people that have done this, and there is a lot behind the scenes to make it work.
I dare myself to say, that EVERY SINGLE simple solution have major drawbacks. Proper way would be to use some control that already does this "syntax highlighting" and extending it to your syntax. It is also most likely the easiest way.
You can search free .net controls in Codeplex. link

I would try handling the KeyDown, and checking for the closing bracket instead "]". Once you receive one, you could check the last character in your text box for the second ], and if it's there, just replace out the last few characters.
This eliminates the need for maintaining state (ie: the list). As soon as the second ] was typed, the block would change to a link instantly.

Keeping a list will be rather complex I think. What if the user types a '[' character, clicks somewhere else in the text and then types a '[' character again. The user has then typed two consecutive '[' characters but in completely different parts of the text. Also, you may want to be able to handle text inserted from the clipboard as well.
I think the safest way is to analyze the full text and do what should be done from that context, using RegEx or some other technique.

(Sorry, don't have enough reputation to add comments yet, so have to add a new answer). As suggested by jeffamaphone I'd handle the TextChanged event and rescan the text each time - but to keep the cost constant, just scan a few characters ahead of the current cursor position instead of reading the entire text.
Trying to intercept the keystrokes and maintain an internal state is a bad approach - it is very easy for your idea of what has happened to get out of sync with the control you are monitoring and cause weird problems. (and how do you handle clicks? Alt-tab? Pastes? arrow keys? Other applicatiosn grabbing focus? Too many special cases to worry about...)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.