I am writing a little application that is supposed to listen for user commands and send keystrokes to another program. I am using Speech Recognition Engine Class but my script doesn't work properly.
If I use a custom grammar (with very few words like "start" or "exit") the program will always recognize one of the my words even though I said something completely different.
For istance I say "stackoverflow" and the program recognizes "start".
With a default dictionary the program becomes almost impossible to use (I have to be 100% correct otherwise it won't understand).
The strange thing is that if I use Speech Recognizer instead of Speech Recognition Engine my program works perfect but ofcourse everytime I say something unrelated it messes up because Windows Speech Recognition handles the result and I don't want that to happen. That is the reason why I am using Speech Recognition Engine actually.
What am I doing wrong?
Choices c = new Choices(new string[] { "use", "menu", "map", "save", "talk", "esc" });
GrammarBuilder gb = new GrammarBuilder(c);
Grammar g = new Grammar(gb);
sr = new SpeechRecognitionEngine();
sr.LoadGrammar(g);
sr.SetInputToDefaultAudioDevice();
sr.SpeechRecognized += sr_SpeechRecognized;
Almost forgot, I don't know if that's relevant but I am using Visual Studio 11 Ultimate Beta.
For each speech recognition result detected you also receive the confidence for the recognition - a low confidence level would indicate that the engine is "not so sure" about the result and you might want to reject it, e.g.:
private void SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
if (e.Result.Confidence >= 0.7)
{
//high enough confidence, use result
}
else
{
//reject result
}
}
Related
Below is the speech recognition code I have. I've noticed that its not that good at picking up sentences. Is there something I can do to fix this?
public String listenForVoice() {
SpeechRecognitionEngine recognizer = new SpeechRecognitionEngine(new CultureInfo("en-US"));
Grammar dictationGrammar = new DictationGrammar();
recognizer.LoadGrammar(dictationGrammar);
try
{
recognizer.SetInputToDefaultAudioDevice();
RecognitionResult result = recognizer.Recognize();
return result.Text;
} catch (InvalidOperationException exception) { }
finally {
recognizer.UnloadAllGrammars();
}
return "";
}
First:
To improve the accuracy of Speech Recognition Engine try to create and load grammar that is similar to the word/grammar/speech you want to use. This way we can improve the accuracy.
Second:
Evaluate hypothesized trigger 1, trigger 2 and then recognized confidence levels and results. This is not practical as this would be different for each person/user.
There is no way to prevent the .NET Speech Recognition Engine from ALWAYS RETURNING A GRAMMAR MATCH. You may as well be saying "bob" in a silent room into a studio grade mic and it would recognize "open windows media player". lol
Warning 1: grammar word lists of over 1,000 slow things down and can lock the application.
Warning 2: en-US has good English recognition capabilities, switching to en-GB etc lowers accuracy drastically
So far with Googles Speech Recognition API (does require you to be online) but it is 10x more accurate and you can easily test for a match yourself.
I have been messing around with C# voice to text which has been pretty easy. However I am trying to figure out how I can detect a candid sentence versus looking from the preset commands I've made.
Currently I can do various things by listening for keywords:
SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine();
public Form1()
{
_recognizer.SetInputToDefaultAudioDevice();
_recognizer.LoadGrammar(new Grammar(new GrammarBuilder(new Choices(File.ReadAllLines(#"Commands.txt")))));
_recognizer.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(_recognizer_SpeechRecognized);
_recognizer.RecognizeAsync(RecognizeMode.Multiple);
}
void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
//magic happens here
}
This works great as I said before. I've tried to use some of the other functions associated with speech recognition such as SpeechHypothesized but it will only guess words based on the grammar loaded into the program which are preset commands. This makes sense. However if I load a full library into the grammar then my commands will be much less accurate.
I am trying to setup the program so when a keyword is said (one of the commands) it will listen to try to transcribe an actual sentence.
I was looking for "Speech Dictation." Instead of loading my command list into the grammar I was able to use DictationGrammar() that was built into c# in order to listen for a complete sentence.
SpeechRecognitionEngine recognitionEngine recognitionEngine = new SpeechRecognitionEngine();
recognitionEngine.SetInputToDefaultAudioDevice();
recognitionEngine.LoadGrammar(new DictationGrammar());
RecognitionResult result = recognitionEngine.Recognize(new TimeSpan(0, 0, 20));
foreach (RecognizedWordUnit word in result.Words)
{
Console.Write(“{0} “, word.Text);
}
Although this method doesn't seem very accurate even if I use the word.Confidence it seems only to guess the correct word less than half the time.
May try to use the Google Voice API to send flac file for post processing.
I am using c# and the windows speech recognition in order to communicate with my program. The only word to be recognized is "Yes", this works fine in my program the only problem is that since the speech recognition is activated it will type in what ever I am saying is there a way to limit the speech recognition program to only recognize one word, in this case the word "yes"?
Thank you
What do you mean "since the speech recognition is activated it will type in what ever I am saying"? Are you saying that the desktop recognizer continues to run and handle commands? Perhaps you should be using an inproc recognizer rather than the shared recognizer (see Using System.Speech.Recognition opens Windows Speech Recognition)
Are you using a dictation grammar? If you only want to recognize a limited set of words or commands, do not use the dictation grammar. Use a GrammarBuilder (or similar) and create a simple grammar. See http://msdn.microsoft.com/en-us/library/hh361596
There is a very good article that was published a few years ago at http://msdn.microsoft.com/en-us/magazine/cc163663.aspx. It is probably the best introductory article I’ve found so far. It is a little out of date, but very helfpul. (The AppendResultKeyValue method was dropped after the beta.) Look at the examples of how they build the grammars for ordering Pizza.
One thing to keep in mind, a grammar with one word may show many false positives (since the recognizer will try to match to something in your grammar). You may want to put in at lest Yes and No so it can have something to compare to.
If your code is similar to the following:
SpeechRecognitionEngine recognitionEngine = new SpeechRecognitionEngine();
recognitionEngine.SetInputToDefaultAudioDevice();
recognitionEngine.SpeechRecognized += (s, args) =>
{
foreach (RecognizedWordUnit word in args.Result.Words)
{
Console.WriteLine(word.Text);
}
};
recognitionEngine.LoadGrammar(new DictationGrammar());
Just use an if statement:
foreach (RecognizedWordUnit word in args.Result.Words)
{
if (word.Text == "yes")
Console.WriteLine(word.Text);
}
Note that the recognitionEngine.SpeechRecognized is an event handler that happens whenever it recognizes a word and can be used in other ways such as:
{
recognitionEngine.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(recognizer_SpeechRecognized);
}
//this method is static because I called it from a console main method. It can be changed.
static void recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e)
{
Console.WriteLine(e.Result.Text);
}
My examples are in a Console but it works the same for GUI.
I've started using .NET speech-to-text library (SpeechRecognizer)
While googling and searching this site i found this code sample:
var c = new Choices();
for (var i = 0; i <= 100; i++)
c.Add(i.ToString());
var gb = new GrammarBuilder(c);
var g = new Grammar(gb);
rec.UnloadAllGrammars();
rec.LoadGrammar(g);
rec.Enabled = true;
Which helped me to start. I changed these 2 lines
for (var i = 0; i <= 100; i++)
c.Add(i.ToString());
to my need
c.Add("Open");
c.Add("Close");
But, when I say 'Close', the speech recognizer of windows closes my application!
In addition, Is there a better way to recognize speech than to create my own dictionary? I would like the user to say something like: "Write a note to myself" and then the user will speak and I'll write.
Sorry for asking 2 questions at the same question, both seem to be relevant to my one problem.
You are using the shared speech recognizer (SpeechRecognizer). When you instantiate
SpeechRecognizer you get a recognizer that can be shared by other applications and is typically used for building applications to control windows and applications running on the desktop.
It sounds like you want to use your own private recognition engine (SpeechRecognitionEngine). So instantiate a SpeechRecognitionEngine instead.
see SpeechRecognizer Class.
Disable built-in speech recognition commands? may also have some helpful info.
Microsoft's desktop recognizers include a special grammar called a dictation grammar that can be used to transcribe arbitrary words spoken by the user. You can use the dictation grammar to do transcription style recognition. See DictationGrammar Class and SAPI and Windows 7 Problem
I have a better answer....
Try adding a dictation Grammar to your recognizer... it seems to disable all built-in commands like select/delete/close etc..
You then need to use the speech recognized event and SendKeys to add text to the page. My findings so far indicate that you can't have your SAPI cake and eat it.
I think the solution above should work for you if you've not already solved it (or moved on).
I'm trying to use the SpeechRecognizer with a custom Grammar to handle the following pattern:
"Can you open {item}?" where {item} uses DictationGrammar.
I'm using the speech engine built into Vista and .NET 4.0.
I would like to be able to get the confidences for the SemanticValues returned. See example below.
If I simply use "recognizer.AddGrammar( new DictationGrammar() )", I can browse through e.Results.Alternates and view the confidence values of each alternate. That works if DictationGrammar is at the top level.
Made up example:
Can you open Firefox? .95
Can you open Fairfax? .93
Can you open file fax? .72
Can you pen Firefox? .85
Can you pin Fairfax? .63
But if I build a grammar that looks for "Can you open {semanticValue Key='item' GrammarBuilder=new DictationGrammar()}?", then I get this:
Can you open Firefox? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you open Fairfax? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you open file fax? .91 - Semantics = {GrammarBuilder.Name = "can you open"}
Can you pen Firefox? .85 - Semantics = null
Can you pin Fairfax? .63 - Semantics = null
The .91 shows me that how confident it is that it matched the pattern of "Can you open {item}?" but doesn't distinguish any further.
However, if I then look at e.Result.Alternates.Semantics.Where( s => s.Key == "item" ), and view their Confidence, I get this:
Firefox 1.0
Fairfax 1.0
file fax 1.0
Which doesn't help me much.
What I really want is something like this when I view the Confidence of the matching SemanticValues:
Firefox .95
Fairfax .93
file fax .85
It seems like it should work that way...
Am I doing something wrong? Is there even a way to do that within the Speech framework?
I'm hoping there's some inbuilt mechanism so that I can do it the "right" way.
As for another approach that will probably work...
Use the SemanticValue approach to match on the pattern
For anything that matches on that pattern, extract the raw Audio for {item} (use RecognitionResult.Words and RecognitionResult.GetAudioForWordRange)
Run the raw audio for {item} through a SpeechRecognizer with the DictationGrammar to get the Confidence
... but that's more processing than I really want to do.
I think a dictation grammar only does transcription. It does speech to text without extracting semantic meaning because by definition a dictation grammar supports all words and doesn't have any clues to your specific semantic mapping. You need to use a custom grammar to extract semantic meaning. If you supply an SRGS grammar or build one in code or with SpeechServer tools, you can specify Semantic mappings for certain words and phrases. Then the recognizer can extract semantic meaning and give you a semantic confidence.
You should be able to get Confidence value from the recognizer on the recognition, try System.Speech.Recognition.RecognitionResult.Confidence.
The help file that comes with the Microsoft Server Speech Platform 10.2 SDK has more details. (this is the Microsoft.Speech API for Server applications which is very similar to the System.Speech API for client applications) See (http://www.microsoft.com/downloads/en/details.aspx?FamilyID=1b1604d3-4f66-4241-9a21-90a294a5c9a4.) or the Microsoft.Speech documentation at http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.semanticvalue(v=office.13).aspx
For SemanticValue Class it says:
All Speech platform-based recognition
engines output provide valid instances
of SemanticValue for all recognized
output, even phrases with no explicit
semantic structure.
The SemanticValue instance for a
phrase is obtained using the Semantics
property on the RecognizedPhrase
object (or objects which inherit from
it, such as RecognitionResult).
SemanticValue objects obtained for
recognized phrases without semantic
structure are characterized by:
Having no children (Count is 0)
The Value property is null.
An artificial confidence level of 1.0
(returned by Confidence)
Typically, applications create
instance of SemanticValue indirectly,
adding them to Grammar objects by
using SemanticResultValue, and
SemanticResultKey instances in
conjunction with, Choices and
GrammarBuilder objects.
Direct construction of an
SemanticValue is useful during the
creation of strongly typed grammars
When you use the SemanticValue features in the grammar you are typically trying to map different phrases to a single meaning. In your case the phrase "I.E" or "Internet Explorer" should both map to the same semantic meaning. You set up choices in your grammar to understand each phrase that can map to a specific meaning. Here is a simple Winform example:
private void btnTest_Click(object sender, EventArgs e)
{
SpeechRecognitionEngine myRecognizer = new SpeechRecognitionEngine();
Grammar testGrammar = CreateTestGrammar();
myRecognizer.LoadGrammar(testGrammar);
// use microphone
try
{
myRecognizer.SetInputToDefaultAudioDevice();
WriteTextOuput("");
RecognitionResult result = myRecognizer.Recognize();
string item = null;
float confidence = 0.0F;
if (result.Semantics.ContainsKey("item"))
{
item = result.Semantics["item"].Value.ToString();
confidence = result.Semantics["item"].Confidence;
WriteTextOuput(String.Format("Item is '{0}' with confidence {1}.", item, confidence));
}
}
catch (InvalidOperationException exception)
{
WriteTextOuput(String.Format("Could not recognize input from default aduio device. Is a microphone or sound card available?\r\n{0} - {1}.", exception.Source, exception.Message));
myRecognizer.UnloadAllGrammars();
}
}
private Grammar CreateTestGrammar()
{
// item
Choices item = new Choices();
SemanticResultValue itemSRV;
itemSRV = new SemanticResultValue("I E", "explorer");
item.Add(itemSRV);
itemSRV = new SemanticResultValue("explorer", "explorer");
item.Add(itemSRV);
itemSRV = new SemanticResultValue("firefox", "firefox");
item.Add(itemSRV);
itemSRV = new SemanticResultValue("mozilla", "firefox");
item.Add(itemSRV);
itemSRV = new SemanticResultValue("chrome", "chrome");
item.Add(itemSRV);
itemSRV = new SemanticResultValue("google chrome", "chrome");
item.Add(itemSRV);
SemanticResultKey itemSemKey = new SemanticResultKey("item", item);
//build the permutations of choices...
GrammarBuilder gb = new GrammarBuilder();
gb.Append(itemSemKey);
//now build the complete pattern...
GrammarBuilder itemRequest = new GrammarBuilder();
//pre-amble "[I'd like] a"
itemRequest.Append(new Choices("Can you open", "Open", "Please open"));
itemRequest.Append(gb, 0, 1);
Grammar TestGrammar = new Grammar(itemRequest);
return TestGrammar;
}