I'm using Microsoft Speech C# API for Home Automation commands
I'd like to know if there is a way or built-in C# method to hash Voice Input and recognize who's speaking. If it is Alice or Bob to say "Hello Alice" or "Hello Bob".
EDIT:
Microsoft Speech API can provides a .wav of the recording. It might be able to hash, process, ... to understand who's speaking:
Loud voice, slow modulation, ... => Bob
High voice, fast modulation, ... => Alice
Speaker recognition is a hard problem and is still an active research area. I don't think Microsoft speech api has any speaker recognition support, but not 100% sure.
I found the following article really helpful while researching the topic. It introduces the subject and also provides a very crude implementation. Probably a good place to start.
http://www.ibm.com/developerworks/opensource/library/os-sndpeek/index.html
You can use Microsoft Speaker Recognition APIs for doing this task: https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Microsoft is providing two APIs for this purpose: Speaker Verification & Speaker Identification.
You can find their C# & Python SDKs here: https://github.com/Microsoft/ProjectOxford-ClientSDK/tree/master/SpeakerRecognition
It looks like you are trying to solve the Speaker Diarization problem (finding who speaks when); there are many toolkit available on the Internet for that. I could recommend one (run on Java) called LIUM: http://www-lium.univ-lemans.fr/diarization/doku.php.
If you just interesting on distinguishing Alice and Bob, you can have a look at the Gender Detection part in the Scripting page of the website above (or go directly here http://www-lium.univ-lemans.fr/diarization/doku.php/gender_detection).
Microsoft Speech also has SDK for Speaker Diarization as well.
https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-conversation-transcription
It is available in C++/C#/Java and it has dedicated hardware ready to purchase.
Related
I'm considering porting a speech 2D HTML5 web game I've built to Unity2D for iPhone and Android. I'm a full-stack web developer, and not a Unity developer, so an agency would help me build the Unity app. Before signing with them, I need to be sure both Speech to Text (STT) and Text to Speech (TTS) services are available for Mandarin, Spanish, and English, otherwise I'd waste a lot of money up front.
For Web, Webkit Speech (STT Docs, STT Demo, TTS Docs, TTS Demo) is easily accessible via the browser. I've found that IBM Watson has an API available, and has demos for STT and TTS, and I've found that they have a Unity SDK here, but I don't have the skillsets to test the Unity SDK.
I'm looking for guidance on great STT and TTS APIs that the agency can use for those three foreign languages.
Does the Unity SDK provide support for frontend STT and TTS audio streaming? STT needs to capture users' voice input and transcribe it quickly. Likewise, TTS needs to allow the user to hover over a target language word and listen to a near-native pronunciation.
Does it offer both STT and TTS for Spanish, Mandarin, and English?
What other NLP APIs are there which meet my requirements?
Apologies, I'm completely new to Unity/phone development so any guidance here would be extremely helpful. If no APIs exist that meet these requirements then Unity won't work for my app since STT and TTS is critical.
Overall, realtime audio recording in Unity is awful, the system is simply not designed to record audio continuously. You can record a clip with AudioSource but that is a clip of fixed length, not a streaming solution.
For streaming you can get the audio with AudioFilterRead but it is not really the API for recording, it is more for effects. For recording it has unpredictable latency and also slows down the UI significantly.
As a result, you can only have push-to-talk kind of interaction, not realtime interaction.
If you have other alternatives you'd better consider them too. For example, you can consider native app.
I searched on Google, but I did not find much information about it. I was wondering if anyone had experience so that they knew a proper way to get data input from a microphone and also know how to play it. What I would like to do is a typical streaming app in C#, which takes audio from the microphone and sends it on the client application. I await advice, thank you.
There're many source codes available if you can use google and bing... If you want to build this application with C# programming language, then you need to know some basics of Network Programming in C#.
If you want build a program like voice chat. You will need grab the audio from the microphone using some technalogies like DirectSound, UDP packets and etc.
If you want build a video streaming application you can use several ways to get video streaming/conferencing with .net easily.
Use of plain Windows Media Encoder components, RTC Clients, voice/SIP, Sockets and etc.
So you have wide choice of managed technologies here. Another thing is Live Meeting at which you had no chance to take good look yet.
For those still interested I found the NAudio library really interesting: https://github.com/naudio/NAudio
I am making a flashlight application in unity C#. The application is almost complete I just want to add this voice command feature in this like when I say "ON" the flashlight should turn on and when I say " OFF " the flashlight should turn off. The application is for Android devices. I saw several tutorials about calling functions on voice commands but that all were only for windows platform please help me if you know something about doing this in android thanks
I have not used any Speech Recognition tools but its not very difficult to implement if you can create a java plugin & use it to call native function. Anyways I have found few of the SDK:
You can check out the pocket sphinx demos for speech recognition.
https://github.com/cmusphinx/pocketsphinx
https://github.com/cmusphinx/pocketsphinx-android-demo
Here is a repo I found which uses AndroidSpeechRecognition.
https://github.com/gsssrao/UnityAndroidSpeechRecognition
Programmer has given a nice explaination of voice recognition implementation natively:
How to add Speech Recognition to Unity project?
Then there is WatsonSDK for unity but it seems to be via cloud but you can check this one out:
https://github.com/watson-developer-cloud/unity-sdk
And if you dont mind paying for this plugin called Android SpeakNow you can grab it from asset store:
https://assetstore.unity.com/packages/tools/integration/android-speaknow-16781
These are some cloud based packages from asset store, I really doubt you might need this one to implement but in any case this is for someone who may require them at some point of time:
https://assetstore.unity.com/packages/add-ons/machinelearning/google-cloud-speech-recognition-vr-ar-desktop-desktop-72625
https://assetstore.unity.com/packages/tools/integration/yandex-cloud-speech-recognition-vr-ar-mobile-desktop-75155
And finally DictationRecognizer; by default this one is available only for windows 10 as of Unity 2018.2. So this is out of question. My best bet would be cmusphinx or implementing natively which I believe would be more suitable for your needs. Check them out. Try to implement one or two and let us know if you were successful or not.
If anyone can add more links to SDK for voice recognition feel free to add. This would be really great.
If you just need only ON and OFF voice inputs you can use the following code
Speech to text in unity
If you need exact speech recognition then refer the following code
Speech recognition in unity
I have a professional sound card, and I want to record the signals from the guitar with c++ or c# for developing guitar effects in real time.
How can i record in real time through a c++ method ?
Is it mean that I need the sound card API ?
this one is enough?
Although may not be as easy as using a pre-built library, you may be able to get a C++ SDK for your sound card from the manufacturer. I would start by browsing their site or contacting support.
If that isn't an option, you can also use DirectSound which is part of the DirectX family of products. The learning curve is fairly steep but I believe it should do just about anything you want.
One final option is to look at a favorite tool (such as sound forge). A number of these tools support automation which means you can click through the app, decide what you want, then automate that sequence of events (See this as an example).
Hope that helps, best of luck!
Side Note: I have developed a number of hardware interfaces and in my experience its best to start with an example that does at least something like what you are looking for, then modify the code from there. If any particular option doesn't have an example like this I would probably skip it in favor of an example that does.
Examples
Direct Sound - Microsoft has a learning site for direct sound which you can find here. I also found this blog article which has an example for recording audio with direct sound.
Sound Forge - If you download the "Script Developers Kit" there are examples for C# in the scripts folder that should get you started. I believe this particular tool is more focused on editing and effects but I am guessing there should be automation for recording.
To just record audio in real time, any API will be fine. Note that WASAPI is the primary API (since Vista), and legacy APIs like WaveIn API, DirectSound are implemented on top of WASAPI as compatibility layers.
Regular APIs assume you are okay to certain processing latency/overhead, on the order of tens of milliseconds.
If you are going to be faster than this, and you need real time performance, such as to process data and return in back for playback as soon as possible, you need so called exclusive mode streams, where you can achieve latencies on the order of a few milliseconds, which is on par to professional audio development kits.
Windows SDK has a few audio recording samples in \Samples\multimedia\audio (C++)
It's probably a good idea to use a third party library for that.
There's a multitude of options. The ones I know of are portaudio and STK.
I like the Fmod API which supports recording (Sound recording with FMOD library) and realtime effects.
I am working on a project which I need to create a custom voice engine for my application. I have seen something like the TTS Builder, but is there someone who understands how applications such as the TTS Builder itself is developed? What is the thing behind SAPI engines? How do they work? How can one construct his/her own? Can I develop my own algorithm? I would prefer to do this in C# if possible
From what I see, it looks like TTS Builder takes existing voices and allows you to tweak minor parameters to make a slightly different-sounding voice. But creating a voice with a different accent or pronunciation I think is more complex.
From AT&T Research:
Creating high-quality voices requires a good voice talent, a sound-proof room, professional audio equipment, hours of written material with thorough coverage of phoneme combinations in the language, and the time and expertise to turn those recordings into a decent synthetic voice. Because of the expense involved, custom voice builds are usually done for corporations that want to computerize an existing actor's voice, for example to continue a brand image.
...
It may take far less material to build a tranformation model than it does to build a TTS voice from scratch.