Speech to text and text to speech for foreign languages - c#

I'm considering porting a speech 2D HTML5 web game I've built to Unity2D for iPhone and Android. I'm a full-stack web developer, and not a Unity developer, so an agency would help me build the Unity app. Before signing with them, I need to be sure both Speech to Text (STT) and Text to Speech (TTS) services are available for Mandarin, Spanish, and English, otherwise I'd waste a lot of money up front.
For Web, Webkit Speech (STT Docs, STT Demo, TTS Docs, TTS Demo) is easily accessible via the browser. I've found that IBM Watson has an API available, and has demos for STT and TTS, and I've found that they have a Unity SDK here, but I don't have the skillsets to test the Unity SDK.
I'm looking for guidance on great STT and TTS APIs that the agency can use for those three foreign languages.
Does the Unity SDK provide support for frontend STT and TTS audio streaming? STT needs to capture users' voice input and transcribe it quickly. Likewise, TTS needs to allow the user to hover over a target language word and listen to a near-native pronunciation.
Does it offer both STT and TTS for Spanish, Mandarin, and English?
What other NLP APIs are there which meet my requirements?
Apologies, I'm completely new to Unity/phone development so any guidance here would be extremely helpful. If no APIs exist that meet these requirements then Unity won't work for my app since STT and TTS is critical.

Overall, realtime audio recording in Unity is awful, the system is simply not designed to record audio continuously. You can record a clip with AudioSource but that is a clip of fixed length, not a streaming solution.
For streaming you can get the audio with AudioFilterRead but it is not really the API for recording, it is more for effects. For recording it has unpredictable latency and also slows down the UI significantly.
As a result, you can only have push-to-talk kind of interaction, not realtime interaction.
If you have other alternatives you'd better consider them too. For example, you can consider native app.

Related

Call Functions on Voice Commands In Android Unity

I am making a flashlight application in unity C#. The application is almost complete I just want to add this voice command feature in this like when I say "ON" the flashlight should turn on and when I say " OFF " the flashlight should turn off. The application is for Android devices. I saw several tutorials about calling functions on voice commands but that all were only for windows platform please help me if you know something about doing this in android thanks
I have not used any Speech Recognition tools but its not very difficult to implement if you can create a java plugin & use it to call native function. Anyways I have found few of the SDK:
You can check out the pocket sphinx demos for speech recognition.
https://github.com/cmusphinx/pocketsphinx
https://github.com/cmusphinx/pocketsphinx-android-demo
Here is a repo I found which uses AndroidSpeechRecognition.
https://github.com/gsssrao/UnityAndroidSpeechRecognition
Programmer has given a nice explaination of voice recognition implementation natively:
How to add Speech Recognition to Unity project?
Then there is WatsonSDK for unity but it seems to be via cloud but you can check this one out:
https://github.com/watson-developer-cloud/unity-sdk
And if you dont mind paying for this plugin called Android SpeakNow you can grab it from asset store:
https://assetstore.unity.com/packages/tools/integration/android-speaknow-16781
These are some cloud based packages from asset store, I really doubt you might need this one to implement but in any case this is for someone who may require them at some point of time:
https://assetstore.unity.com/packages/add-ons/machinelearning/google-cloud-speech-recognition-vr-ar-desktop-desktop-72625
https://assetstore.unity.com/packages/tools/integration/yandex-cloud-speech-recognition-vr-ar-mobile-desktop-75155
And finally DictationRecognizer; by default this one is available only for windows 10 as of Unity 2018.2. So this is out of question. My best bet would be cmusphinx or implementing natively which I believe would be more suitable for your needs. Check them out. Try to implement one or two and let us know if you were successful or not.
If anyone can add more links to SDK for voice recognition feel free to add. This would be really great.
If you just need only ON and OFF voice inputs you can use the following code
Speech to text in unity
If you need exact speech recognition then refer the following code
Speech recognition in unity

Perform real time continuous speech recognition using Xamarin and Microsoft Speech Service API

I saw on the documentation of the Bing Speech API that it is possible to stream a recording microphone input to the REST service (https://learn.microsoft.com/en-us/azure/cognitive-services/speech/home):
Real-time continuous recognition. The speech recognition API enables
users to transcribe audio into text in real time, and supports to
receive the intermediate results of the words that have been
recognized so far.
However, I was not able to find a sample showing how this could be achieved in a cross-platform fashion using Xamarin Forms.
I have found the following tutorial: https://developer.xamarin.com/guides/xamarin-forms/cloud-services/cognitive-services/speech-recognition/
But in this, the audio stream sent to the API is an already existing audio file, what I would like to achieve, however, is to stream the microphone input of the device running the app (Android, iOS, UWP).
Any insight would be appreciated.
I am afraid that there are no libraries compatible with Xamarin that support real-time Microsoft Speech API. The only compatible is the Bing Speech API which uses the REST protocol and does not offer the real-time transcription.
The real-time transcription requires Speech Service WebSocket protocol which is fully documented. You could implement this interface yourself, but it may be quite a complex task to do it reliably.
There are however native libraries for iOS and Android which do support the real-time streaming functionality. You can see tutorial for iOS and tutorial for Android.
What you could do then is use Xamarin Binding Libraries to bind the native libraries into your Xamarin project. For Java library see this tutorial and for Objective-C library see this tutorial.
Especially creating the Objective-C binding might be a daunting task and it is usually easier to create a Objective-C library that will act as a facade, which then uses the native library. You will know the interface of your facade library and you will then be able to create the binding more easily. You may also consider asking the Xamarin team to create the binding for you, as they maintain a growing collection of third-party library bindings on GitHub.
I have a cross platform solution using Bing Speech. Got the IOS working. Never tested the Android solution.
There is a great library here that should fit your needs:
https://github.com/NateRickard/Xamarin.Cognitive.BingSpeech

How to identify speaker from voice pattern using Microsoft Speech?

I'm using Microsoft Speech C# API for Home Automation commands
I'd like to know if there is a way or built-in C# method to hash Voice Input and recognize who's speaking. If it is Alice or Bob to say "Hello Alice" or "Hello Bob".
EDIT:
Microsoft Speech API can provides a .wav of the recording. It might be able to hash, process, ... to understand who's speaking:
Loud voice, slow modulation, ... => Bob
High voice, fast modulation, ... => Alice
Speaker recognition is a hard problem and is still an active research area. I don't think Microsoft speech api has any speaker recognition support, but not 100% sure.
I found the following article really helpful while researching the topic. It introduces the subject and also provides a very crude implementation. Probably a good place to start.
http://www.ibm.com/developerworks/opensource/library/os-sndpeek/index.html
You can use Microsoft Speaker Recognition APIs for doing this task: https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Microsoft is providing two APIs for this purpose: Speaker Verification & Speaker Identification.
You can find their C# & Python SDKs here: https://github.com/Microsoft/ProjectOxford-ClientSDK/tree/master/SpeakerRecognition
It looks like you are trying to solve the Speaker Diarization problem (finding who speaks when); there are many toolkit available on the Internet for that. I could recommend one (run on Java) called LIUM: http://www-lium.univ-lemans.fr/diarization/doku.php.
If you just interesting on distinguishing Alice and Bob, you can have a look at the Gender Detection part in the Scripting page of the website above (or go directly here http://www-lium.univ-lemans.fr/diarization/doku.php/gender_detection).
Microsoft Speech also has SDK for Speaker Diarization as well.
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-conversation-transcription
It is available in C++/C#/Java and it has dedicated hardware ready to purchase.

C# Speech Recognition from System Audio (Speaker Sound)

I've seen speech recognition from input devices (obviously) and I've seen speech recognition from files (http://gotspeech.net/forums/thread/6835.aspx). However, I was wondering whether it would be possible to run speech recognition on system audio in real time. By system audio, the sound that comes out of your speakers.
It would be a great tool for those who are hard of hearing, as they are watching YouTube videos, the C# Application could transcribe what's being said.
How could I go about doing this?
Very easily - Go to the sound mixer, choose input and enable/unmute "Stereo Mix". You should, of course, mute the mic if you don't want to record that too. Then, just start recording the same way you'd record the mic - now you'll get the same feed as the speakers at digital quality.
This can be done programatically although it can be fiddly - especially if you want to support WinXP as well as Vista/Win7 (Sound was overhauled in Vista and I believe the APIs are significantly different although I haven't had to use them yet).
You're almost certainly going to need to filter the sound before attempting recognition. Unless the speech recog. library you're using is designed to work in adverse conditions, music and special effects will interfere with proper recognition as will multiple people speaking at the same time.
If you haven't got a super-robust library, filters to attenuate non-vocal frequencies are going to be a must. You may also need to apply volume normalisation to account for loud/quiet scenes - There are hundreds of filters that could potentially improve matching.
You may want to access the recognition API at the lowest level to get as much control as possible - You'll need to tweak it to cope with people shouting, breathless, crying, etc... If you start designing for flexible low-level access, it will probably save you weeks if you find you need it later on and have to re-architect.
I'd suggest you look into NAudio as a starting point for audio processing
I suspect you'll be able to get something which works under ideal conditions without too much effort - but tweaking it to work well in all eventualities may be a mammoth task. That said, it sounds like a fun project.
You could improve recognition chance considerably by creating genre-, user- or show-specific dictionaries. These could either be pre-generated, or built automatically using a weighted feedback loop - perhaps also allowing the user to correct mistakes.

How to create a custom sapi voice for tts

I am working on a project which I need to create a custom voice engine for my application. I have seen something like the TTS Builder, but is there someone who understands how applications such as the TTS Builder itself is developed? What is the thing behind SAPI engines? How do they work? How can one construct his/her own? Can I develop my own algorithm? I would prefer to do this in C# if possible
From what I see, it looks like TTS Builder takes existing voices and allows you to tweak minor parameters to make a slightly different-sounding voice. But creating a voice with a different accent or pronunciation I think is more complex.
From AT&T Research:
Creating high-quality voices requires a good voice talent, a sound-proof room, professional audio equipment, hours of written material with thorough coverage of phoneme combinations in the language, and the time and expertise to turn those recordings into a decent synthetic voice. Because of the expense involved, custom voice builds are usually done for corporations that want to computerize an existing actor's voice, for example to continue a brand image.
...
It may take far less material to build a tranformation model than it does to build a TTS voice from scratch.

Categories