Character / Language

Character / Language - c#

I just developed a simple asp.net mvc application project for English only. I want to block user's any input for a language other than English. Is it possible to know whether user inputs other languages when they write something on textbox or editor in order to give a popup message?

You could limit the input box to latin characters, but there's no automatic way to see if the user entered something in say English, Finnish or Norwegian. They all mostly use a-z. Any character outside of a-z could give you an indication, but certain accents needs to be allowed in English as well, so it's not 100%.
Google Translate exposes a javascript API to detect the language of text.

Use the following code:
<p>Note that this community uses the English language exclusively, so please be
considerate and write your posts in English. Thank you!</p>

there are two tests you can do. one is to find out what the cultureinfo is set on the users machine:
http://msdn.microsoft.com/en-us/library/system.threading.thread.currentuiculture.aspx
this will give you their current culture setting, which is a start. of course, you can have your setting as 'english' but still typing in russian, and most of the letters will be the same..
so the next step is to discover the language using this: http://www.google.com/uds/samples/language/detect.html
it's not the greatest, according to online discussions, but its a place to start. I'm sure there are better natural language identifiers out there, though.

Checking for Latin 26
If you wanted to ensure that any non-English letters were submitted, you could simply validate that they fall outside the A-Z, a-z, 0-9 and normal punctuation ranges. It sounds like you want the regular non-Latin characters to be detected and rejected.
Detecting the user's OS settings, keyboard settings isn't the best way, as the user could have multiple keyboards attached, and have use of copy/paste.
UI Validation
At the user interface level, you could create a jQuery method that would check the value of a textbox for a value other than your acceptable range. Perhaps that's A-Z, a-z and numeric. You could do this on event onBlur. Remember that you might want to allow ', .
$('#customerName').blur(function() {
var isAlphaNumeric;
//implementation of checking a-z, A-Z, 0-9, etc.
alert(isAlphaNumeric);
});
Controller Validation
If you wanted to ALSO implement this at the controller level, you could run a regex on the incoming values.
public ActionMethod CreateCustomer(string custName)
{
if (IsAcceptableRange(custName))
{
//continue
}
}
public bool IsAcceptableRange(string input)
{
//whitelist all the valid inputs here. be sure to include
//space, period, apostrophe, hypen, etc
Regex alphaNumericPattern=new Regex("[^a-zA-Z0-9]");
return !alphaNumericPattern.IsMatch(input);
}

Google Translate was quoted in two answers, but I want to add that Microsoft Word API may also be used to detect language, just like Word does for check spelling.
It is for sure not the best solution, since language detection by Microsoft Office doesn't work very well (IMHO), but may be an alternative if doing web requests to Google or other remote service on every posted message is not a solution.
Also, check spelling through Microsoft Word API can be useful too. If a message has a huge number of misspelled words when checking in English, it's probably because the message is written in another language (or the author of the message writes too badly, too).
Finally, I completely agree with Matti Virkkunen. The best, and maybe the only way to ensure that messages will be written in English is to ask the users to write in English. Otherwise, it's just as bad as implementing obscenity filters.

Related

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Use Microsoft Translator api to convert Names of peoples not for translation

I'm calling Microsoft translator for converting arabic names to english
But it translates the names , i just want to convert the names
for example:
أحمد ماهر
need to be
Ahmed Maher
the service is working but it translates the meaning of the names not just the names

I don't know about other languages but for your example
أحمد ماهر = Ahmed Maher
What I know is romanizations doesn't work with it
Why? because of the inflection and the letters "movements" that give the one letter more than one pronunciation. for example
مَح = Mah
مُح = Muh
Same letters but with a different pronunciation.
The solution that I've made is rules to convert the names
You can find the code and the npm package that I've made.
If you got any questions, issues or you tried to understand the rules for the Arabic I will be happy to help.
Please check below to know more about the needed rules to convert an Arabic name:
First Letter Rule
I am checking if the letter was the first letter since if the letter “و” was the first one people tend to write it “W” but if it was inside the word it will be written “O”.
Inner Letter Rule
By this I mean all the letters that are not the first nor the last and it will be changed based letter like the first image “م” will equal “H”
Next Letter Rule
In this rule am checking the letter and the upcoming one. like
if the letter was “م” and the next one was “ع” it should be written ”Mua”
if the letter was “م” and the next one was “ي” it should be written “May”

Special Letter Rule
It looks like the next letter rule but it will include another action like “slice”
ex: in the “First Letter Rule” I said that “و” inside the word will be converted to “O”
.”ا" I will need to delete the “O” and put a “w” then “A” for the ”ا“ But if the next letter was And I believe this rule could be enhanced but it’s important to have.
Last Letter Rule
Without this rule, the name “Sarah” will be converted from Arabic to “Sara”
This rule checks if the last letter is “ه“ ”ة” if it is it will add the “H” to complete the name.

Sounds like you want a character substitution, not a true translator. You could build your own transliterator with String.Replace.
Someone asked a similar question about transliterating Cyrillic to Latin:
How to transliterate Cyrillic to Latin text

If it is on a web site, you can enable the Collaborative Translations Framework functionality in the Widget, and use that to 'override' the translation into a name.
The API also supports this if you are building an app.
Widget documentation is here: http://blogs.msdn.com/b/translation/p/ctf1.aspx
and here: http://blogs.msdn.com/b/translation/p/ctf2.aspx

Word Translation by letter/vowels

I'm looking to translate words by letters/vowels.
I'll try to explain.
I have a an Arabic text with ~300,000 words, my goal is to enable users to search the text using one of 10 languages I'll define. So if some search for Stack overflow in English I'll need to break down to words as S-TA-CK O-VE-R-F-LOW (I need to break it that way to get the Arabic equivalent letters).
Is there something like that already exsiting, or I just need to start from scratch and do a linguistic research???
Thank you for your time.

You need to analyze your words by finding the relative syllables. Take a look at Sphinx-4 Java library, I guess there are some example codes for extracting a word to its syllables based on defined grammar rules.

Optimistic RegEx Matching for User Text Entry

I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?

What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.

Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.

CamelCase conversion to friendly name, i.e. Enum constants; Problems?

In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?

So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.

not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.

Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.