How do I validate for language? ASP.NET - c#

In a textbox in the application, I need to validate to ensure that a user enters only English language text. I know some languages such as Spanish share English's alphabets. How do I validate text to make sure it's:
Only in English language
Supports only languages that use the English character set (Spanish etc)
Thanks
EDIT: Sorry for not being clear enough. This app is on production and when I check the SQL database where the text is stored, there are a lot of rows with "??? ?????". On further investigation, it appears that this is caused when a non english language text is saved to a database. As an example, go to google news, select google Korea from the dropdown, copy some Korean text and save it to a SQL server database
Anyone?

By "English character set", I guess you are referring to the ASCII character set.
You can iterate through each character and see whether it lies in the ASCII range.

You can try to check against an English dictionary (e.g. OpenOffice has a dictionary which you may use for free, not sure about that though) if most of the used words are recognized by this dictionary.
You could also do some kind of text analysis and check the occurance of each character or short sequence like 'th' etc. Each language has specific character occurances and this could help you determining in what language the text is written.
I would not prohibit certain characters because at least in names special characters occur quite often.
I hope you got an idea of some possibilities.
Best Regards, Oliver Hanappi

If this is for a moderately small amount of text, you could try finding an English dictionary web service and try to look up the words. If lookup fails, you most likely either have a typo or something from another language. I haven't found one that accepts large blocks of text, but there is a web service that operates off of the dict.org database:
DictService

One way is to use a English Language Dictionary / Spell Checker , if is valid English / Spanish Word
a very good sample is this
NetSpell Sample - Spell Checker for .NET
It is as simple as follows
NetSpell.SpellChecker.Spelling SpellChecker =
new NetSpell.SpellChecker.Spelling SpellChecker()
SpellChecker.Text = MyTextBox.Text;
SpellChecker.SpellCheck();
NetSpell Home Page: http://www.loresoft.com/NetSpell

Related

WPF TextBox - German letter ß automatically replaced with ü

We'd added German localization to our WPF Application and later we've got a feedback from one of our German users. He told us that he was unable to input German letter "ß" - it was automatically replaced with letter "ü".
Looking forward to hear some answers or suggestions.
Issue example screenshot:
Finally I've found a problem. Actually our application use comparison and mapping of System.Windows.Input.Key and native ScanCodeShort to fix HotKeys functionality due to different cultures. As a result - German symbol 'ß' is detected as 'OEM4' key and as a ScanCodeShort it's equal to 'OemOpenBrackets'. That's why symbol 'ü' is inputted instead of 'ß'. Now I'm trying to find a solution for this problem, but it seems that's another story.

Detect the language of a text is english in PDF or DOC files

Requirement is that i want to identify that the text written in PDF or Doc is english or non english. if i got a single word of (turiskh, french,arabic and etc.) have to avoid the whole documnet
its urgent plz give me sample code for this functionality
Have a look on Google Translate API, only free service who could do this for you what I know. Otherwise I can only see the solution of having your own dictionary etc.. But thats a different story
I guess you could use LangId. However there are some restrictions:
To use our API in live websites or services we suggest you to apply for a free API key, using the below form. The API key expands your developing possibilities allowing you to do up till 1,000 requests per hour (~720,000 per month).
I don't think this will solve your 'single word' issue however. I believe if the text has 6 words English and 4 words in another language it will see the text as English since that language is mainly used in the file. I haven't looked at the API myself though so there might be some solutions for that.
Hope it is of use to you.
Maybe the detect function of Google's Translate API could help you:
http://code.google.com/apis/language/translate/v2/getting_started.html#language_detect
This is not possible for single words.
Is "the" an English word? Well, yes, but it's also a Danish word (meaning tea). Does the word Schadenfreude indicate a non-english text? Not necessarily, it all depends on the context.
Adding to the list of APIs that support language determination, Bing API has a call that will determine the language for an array of strings.
http://msdn.microsoft.com/en-us/library/ff512412.aspx
Hope this helps somewhat.

Phonetic characters to speech

My purpose is that to be able to let my application to talk in less popular language (for example Hokkien, Malay, etc). My current approach is using recorded mp3.
I want to know whether there is 'phonetic characters to speech' engine exists for .net or any platform?
Phonetic characters here just like the phonetic entry in paper dictionary. Any idea?
What you need is a Large Vocabulary TTS Engine. Microsoft has a speech SDK that allows you to say as you type among other things, and also the Windows SAPI (Speech API - not sure if the SDK and API are the same things). I know that they do have male and female voices for English, but maybe not for other languages such as Malay (where there may not have been much of a market as yet). You might want to take a look at Festival Project at CMU. They usually have a lot of voices in different languages, but some of the less known ones may not be as well developed as the ones for English.
Further update:
Check the MBROLA site out. It is an open-source project for developing multi-lingual Large vocab TTS engines and they also have a malay extension. I do not know how good it is though. I tried out the Hindi one and feel that there is a lot of work that still needs to be done.
Also, check out the BabelFish site. They have links to a lot of free TTS engines that should have some support for Malay.
Update 3: I do not know if this will suit your purpose, but if the text that the application must speak out is low, then you can try concatenative speech synthesis over a limited vocabulary too. Record fragments of sentences in Malay (or any other language) and pass the output of your program to your own limited vocab tts engine where you create the output. One example could be (in English): " was the most valuable player." Here, "was the most valuable player" becomes one fragment while the "Player X" can be changed at will. This, if it serves your purpose, should work well.
Have you looked at the System.Speech namespaces?
In particular the System.Speech.Synthesis and System.Speech.Synthesis.TtsEngine namespaces.
Here is the VB.NET code:
'create the object. This object will store your phonetic 'characters'
Dim PBuilder As New System.Speech.Synthesis.PromptBuilder
'add your phonetic 'characters' here. Just ignore the first parameter.
'The second parameter is your phonetic 'characters'
PBuilder.AppendTextWithPronunciation("test", "riːdɪŋ")
'now create a speaker to speak your phonetic 'characters'
Dim SpeechSynthesizer2 As New System.Speech.Synthesis.SpeechSynthesizer
'now actually speaking. It will speak 'reading'
SpeechSynthesizer2.Speak(PBuilder)
And here is the converted C# code:
//create the object. This object will store your phonetic 'characters'
System.Speech.Synthesis.PromptBuilder PBuilder = new System.Speech.Synthesis.PromptBuilder();
//add your phonetic 'characters' here. Just ignore the first parameter.
//The second parameter is your phonetic 'characters'
PBuilder.AppendTextWithPronunciation("test", "riːdɪŋ");
//now create a speaker to speak your phonetic 'characters'
System.Speech.Synthesis.SpeechSynthesizer SpeechSynthesizer2 = new System.Speech.Synthesis.SpeechSynthesizer();
//now actually speaking. It will speak 'reading'
SpeechSynthesizer2.Speak(PBuilder);
The .Net System.Speech.Synthesis.PromptBuilder class will create audio from SSML strings. You can use these to construct sounds from raw phonemes and sampled audio. The audio is not language-dependent.
Maybe this? System.Speech.Recognition.SrgsGrammar.SrgsPhoneticAlphabet
I have tried the System.Speech.Synthesis.PromptBuilder. And I have to say that current implementation of phonetic characters are very elementary and not accurate. For example, the PromptBuilder lacks the speech intonation, and lack of stress emphasis in a word. PromptBuilder only able to output monotone and robotic sound which is very annoying.
My recommendation is that to keep using your current approach. Using mp3 to deliver message is more natural and cost effective in terms of time required to translate perfect phonetic characters of your speech.

How do i let the user input text in German / French for a website

Is there a way that I can let the end user type text in German / French in a text box for a c# asp.net website. Is there any available solution for the same,
Thank You
You really need to read this:
http://www.joelonsoftware.com/articles/Unicode.html
Are you talking about transliterating from German/French into English? If so, that doesn't make much sense. Transliteration is used to convert one system of text/writing into another, like Greek to English or German to Chinese. With French, German, and English, there is no need to transliterate because they all use the same alphabet/writing system.
Was your question more about how to add a transliteration feature a la Google and you just used French and German as an example?

Phone number normalization: Any pre-existing libraries?

I have a system which is using phone numbers as unique identifiers. For this reason, I want to format all phone numbers as they come in using a normalized format. Because I have no control over my source data, I need to parse out these numbers myself and format them before adding them to my DB.
I'm about to write a parser that can read phone numbers in and output a normalized phone format, but before I do I was wondering if anyone knew of any pre-existing libraries I could use to format phone numbers.
If there are no pre-existing libraries out there, what things should I be keeping in mind when creating this feature that may not be obvious?
Although my system is only dealing with US numbers right now, I plan to try to include support for international numbers just in case since there is a chance it will be needed.
Edit I forgot to mention I'm using C#.NET 2.0.
You could use libphonenumber from Google. Here's a blog post:
http://blog.appharbor.com/2012/02/03/net-phone-number-validation-with-google-libphonenumber
Parsing numbers is as easy as installing the NuGet package and then doing this:
var util = PhoneNumberUtil.GetInstance();
var number = util.Parse("555-555-5555", "US");
You can then format the number like this:
util.Format(number, PhoneNumberFormat.E164);
libphonenumber supports several formats other than E.164.
I'm currently involved in the OpenMoko project, which is developing a completely open source cell phone (including hardware). There has been a lot of trouble around normalizing phone numbers. I don't know if anyone has come up with a good solution yet. The biggest problem seems to be with US phone numbers, since sometimes they come in with a 1 on the front and sometimes not. Depending on what you have stored in your contacts list, it may or may not display the caller ID info correctly. I'd recommend stripping off the 1 on the phone number (though I'd expect most people wouldn't enter it in the first place). You may also need to look for a plus sign or country code on the front of international numbers.
You can check around the OpenMoko website, mailing list, and source control to see if they've solved this bug yet.
perl and rails examples
http://validates-as-phone.googlecode.com/svn/trunk/README
http://www.perlmonks.org/?node_id=159645
Just strip out any non-digits, possibly using a RegEx: [^\d]
The only exception might be if you want to handle extensions, to distinguish a number without an area code but with a 3 digit extension, or if you need to handle international numbers.
What you need is list of all country codes and start matching your string first few characters against list of country codes to make sure it's correct then for the rest of the number, make sure it's all digits and of proper length which usually varies from 5-10 digits.
To achieve checking against country codes, install NGeoNames nuget which uses website www.geonames.org to get list of all country codes to use to match against them.

Categories