Limiting length of each "line" in Azure's Speech Translation - c#

I am using this code example from Azure's speech translation (in C#) to build a multi-language subtitler for Zoom calls.
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/quickstart/csharp/dotnet/translate-speech-to-text/helloworld/Program.cs
It's been ages since I've done any coding so I'm trying to get back into it, but what I can't work out is if there is a way to change the way to speech recognizer splits lines. At the moment, it waits until there is a couple of seconds of silence before finalising an answer. I would like it to do that, but also set a time where it will line break, say five seconds or so, if the person is speaking for longer.
Is that possible, does anyone know?
Very sorry if this is a stupid question, I promise I have looked for myself but can't find the right words.

Thanks for asking this. This may not be the ultimate answer to everything you need, but hopefully it will help.
There is a service property you can set that will cause the intermediate results, delivered through the recognizing, event to be more "complete" and not get replaced as the recognition continues.
Here is a reference to the properties available: https://learn.microsoft.com/en-us/javascript/api/microsoft-cognitiveservices-speech-sdk/queryparameternames?view=azure-node-latest
You set the property on the config object like so:
speechTranslationConfig.SetServiceProperty("stableIntermediateThreshold", "3", ServicePropertyChannel.UriQueryParameter);
There is also a property on the translation service that you can set to make it stable as well
speechTranslationConfig.SetServiceProperty("stableTranslation", "true", ServicePropertyChannel.UriQueryParameter);
You might have to play around with the values to determine the right threshold and some languages may have issues if the order of their sentences can change drastically before the end. Japanese is a good example of a language where you might only want to use the final recognition.

Related

Implement Language Auto-Completion based on ANTLR4 Grammar

I am wondering if are there any examples (googling I haven't found any) of TAB auto-complete solutions for Command Line Interface (console), that use ANTLR4 grammars for predicting the next term (like in a REPL model).
I've written a PL/SQL grammar for an open source database, and now I would like to implement a command line interface to the database that provides the user the feature of completing the statements according to the grammar, or eventually discover the proper database object name to use (eg. a table name, a trigger name, the name of a column, etc.).
Thanks for pointing me to the right direction.
Actually it is possible! (Of course, based on the complexity of your grammar.) Problem with auto-completion and ANTLR is that you do not have complete expression and you want to parse it. If you would have complete expression, it wont be any big problem to know what kind of element is at what place and to know what can be used at such a place. But you do not have complete expression and you cannot parse the incomplete one. So what you need to do is to wrap the input into some wrapper/helper that will complete the expression to create a parse-able one. Notice that nothing that is added only to complete the expression is important to you - you will only ask for members up to last really written character.
So:
A) Create the wrapper that will change this (excel formula) '=If(' into '=If()'
B) Parse the wrapped input
C) Realize that you are in the IF function at the first parameter
D) Return all that can go into that place.
It actually works, I have completed intellisense editor for several simple languages. There is much more infrastructure than this, but the basic idea is as I wrote it. Only be careful, writing the wrapper is not easy if not impossible if the grammar is really complex. In that case look at Papa Carlo project. http://lakhin.com/projects/papa-carlo/
As already mentioned auto completion is based on the follow set at a given position, simply because this is what we defined in the grammar to be valid language. But that's only a small part of the task. What you need is context (as Sam Harwell wrote: it's a semantic process, not a syntactic one). And this information is independent of the parser. And since a parser is made to parse valid input (and during auto completion you have most of the time invalid input), it's not the right tool for this task.
Knowing what token can follow at a given position is useful to control the entire process (e.g. you don't want to show suggestions if only a string can appear), but is most of the time not what you actually want to suggest (except for keywords). If an ID is possible at the current position, it doesn't tell you what ID is actually allowed (a variable name? a namespace? etc.). So what you need is essentially 3 things:
A symbol table that provides you with all possible names sorted by scope. Creating this depends heavily on the parsed language. But this is a task where a parser is very helpful. You may want to cache this info as it is time consuming to run this analysis step.
Determine in which scope you are when invoking auto completion. You could use a parser as well here (maybe in conjunction with step 1).
Determine what type of symbol(s) you want to show. Many people think this is where a parser can give you all necessary information (the follow set). But as mentioned above that's not true (keywords aside).
In my blog post Universal Code Completion using ANTLR3 I especially addressed the 3rd step. There I don't use a parser, but simulate one, only that I don't stop when a parser would, but when the caret position is reached (so it is essential that the input must be valid syntax up to that point). After reaching the caret the collection process starts, which not only collects terminal nodes (for keywords) but looks at the rule names to learn what needs to be collected too. Using specific rule names is my way there to put context into the grammar, so when the collection code finds a rule table_ref it knows that it doesn't need to go further down the rule chain (to the ultimate ID token), but instead can use this information to provide a list of tables as suggestion.
With ANTLR4 things might become even simpler. I haven't used it myself yet, but the parser interpreter could be a big help here, as it essentially doing what I do manually in my implementation (with the ANTLR3 backend).
This is probably pretty hard to do.
Fundamentally you want to use some parser to predict "what comes next" to display as auto-completion. This has to at least predict what the FIRST token is at the point where the user's input stops.
For ANTLR, I think this will be very difficult. The reason is that ANTLR generates essentially procedural, recursive descent parsers. So at runtime, when you need to figure out what FIRST tokens are, you have to inspect the procedural source code of the generated parser. That way lies madness.
This blog entry claims to achieve autocompletion by collecting error reports rather than inspecting the parser code. Its sort of an interesting idea, but I do not understand how his method really works, and I cannot see how it would offer all possible FIRST tokens; it might acquire some of them. This SO answer confirms my intuition.
Sam Harwell discusses how he has tackled this; he is one of the ANTLR4 implementers and if anybody can make this work, he can. It wouldn't surprise me if he reached inside ANTLR to extract the information he needs; as an ANTLR implementer he would certainly know where to tap in. You are not likely to be so well positioned. Even so, he doesn't really describe what he did in detail. Good luck replicating. You might ask him what he really did.
What you want is a parsing engine for which that FIRST token information is either directly available (the parser generator could produce it) or computable based on the parser state. This is actually possible to do with bottom up parsers such as LALR(k); you can build an algorithm that walks the state tables and computes this information. (We do this with our DMS Software Reengineering Toolkit for its GLR parser precisely to produce syntax error reports that say "missing token, could be any of these [set]")

calculating fft with complex number in c#

I use this formula to get frequency of a signal but I dont understand how to implement code with complex number? There is "i" in formula that relates Math.Sqrt(-1). How can I code this formula to signal in C# with NAduio library?
If you want to go back to a basic level then:
You'll want to use some form of probabilistic model, something like a hidden Markov model (HMM). This will allow you to test what the user says to a collection of models, one for each word they are allowed to say.
Additionally you want to transform the audio waveform into something that your program can more easily interpret. Something like a fast Fourier transform (FFT) or a wavelet transform (CWT).
The steps would be:
Get audio
Remove background noise
Transform via FFT or CWT
Detect peaks and other features of the audio
Compare these features with your HMMs
Pick the HMM with the best result about a threshold.
Of course this requires you to previously train the HMMs with the correct words.
A lot of languages actually provide Libraries for this that come, built in. One example, in C#.NET, is at this link. This gives you a step by step guide to how to set up a speech recognition program. It also abstracts you away from the low level detail of parsing audio for certain phenomes etc (which frankly is pointless with the amount of libraries there are about, unless you wish to write a highly optimized version).
It is a difficult problem nonetheless and you will have to use a ASR framework to do it. I have done something slightly more complex (~100 words) using Sphinx4. You can also use HTK.
In general what you have to do is:
write down all the words that you want to recognize
determine the syntax of your commands like (direction) (amount)
Then choose a framework, get an acoustic model, generate a dictionary and a language model compatible with that framework. Then integrate the framework into your application.
I hope I have mentioned all important things you need to do. You can google them separately or go to your chosen framework's tutorial.
Your task is relatively simple in terms of speech recognition and you should get good results if you complete it.

person voice identification/recognition

i want to record someones voice and then from information i get about his/her voice i recognize if that person speak again! problem is i have no information about what stats(like frequency ) cause difference to human voice, if any one could help me with how i could recognize someones voice?
while i was researching i found various libraries about speech recognition but they could not help me because my problem is very simpler! i just want to recognize The person who speaking not what he is saying.
The problem you describe is not simple since the voice of the same person can sound different (for example if the person has a cold etc.) and/or if the person is speaking louder/faster/slower etc.
Another point is the separation from other sounds (background, other voices etc.).
The quality of the equipment which records the sound is very important - some systems use multiple microphones to achieve good results...
Altogether this is no easy task - esp. if you want to achieve a good detection ratio.
Basically the way to implement this is:
implement robust sound separation
implement a robust sound/voice pattern extraction
create a DB with fingerprint(s) of the voice(s) you want to recognize based on ideal sound setting
define an algorithm for comparison between your stored fingerprint(s) and the extracted/normalized sound/voice pattern (have some thresholds for "probably equal" etc. might be necessary...)
refine your algorithms till you achieve an acceptable detection rate (take the false positive rate into account too!)
For a nice overview see http://www.scholarpedia.org/article/Speaker_recognition
See VoiceID for Linux. It uses Sphinx and other libs and installs pretty easily.
Some help here, maybe: http://www.generation5.org/content/2004/noReco.asp
Based on an open source FFT library ( http://www.exocortex.org/dsp/ ), with some suggestions about how to do speaker verification.

Intercepting TeamSpeak output in C#

first I'd wanna say this is my first question here, I'll try to comply to the asking tips the best I can... Also I couldn't find a way to post this in a specific C# section so I just used it in the title.
What I'm trying to accomplish is intercepting the sound output of TeamSpeak, and figuring out which person on the channel is producing the loudest sound. I had a look at TeamSpeak SDK but it's more intended for building your own VoIP software than fooling around with TeamSpeak itself...
At first I'm just going to make a simple program that shows the names of the persons and a dB bar (or something that represents loudness) next to them.
I was surprised to see there isn't much discussion around this, I think there's a lot of cool snippets to be made (this one will be for kicking spastics swiftly.)

C++ scanner (string-fu!)

I'm writing a scanner as part of a compiler.
I'm having a major headache trying to write this one portion:
I need to be able to parse a stream of tokens and push them one by one into a vector, ignoring whitespace and tokenizing special symbols (simple case, lets just consider parentheses and braces)
Example:
int main(){ }
should parse into 6 different tokens:
int
main
(
)
{
}
How would you go about solving this? I'm writing this in C++, but a java/C# solution would be appreciated as well.
Some points:
and no, I can't use Boost, I can't guarantee that the libraries will be
available to me. (don't ask...)
I don't want to use lex, or any other special tools. I've never done
this before and just want to try this once to say I've done it.
Stroustrup's book, The C++ Programming Language, has a great example in it about building a lexer/parser for a simple calculator program. It should serve as a good starting point to learn how to do what you want.
Buy a copy of Compilers: Principles, Techniques, and Tools (the Dragon Book). What you're attempting to write is a lexer, not a "scanner".
Why write your own - look at Lex.
If youmust have your own, you just read the input character by character and maintain some minimum state to accumulate identifiers.
The problem itself is not hard. If you can't solve it, you must be burned out, you just need a rest. Look at it again in the morning.
If you really want to learn something from this exercise, just start coding. It doesn't demand a lot of code, so you can fail repeatedly without blowing more than an afternoon.
At this point you'll have a good feel for the problem.
Then look in any random compilers book to see what the "usual" methods are, and you'll grok then immediately.
umm.. I'd just do a while loop with iterators testing each character for type, and only an alpha to non alpha change, dump the string if it's non empty. if it's a non alpha non white space character, I'd just push it onto the token stack, this is really a trivial parsing task. Shoot, I've been meaning to learn lexx/yacc, but the level of parsing you want is really easy. I wrote a html tokenizer once which is more complicated that this.. I mean you are just looking for names, white space and single non alphanumeric characters.. just do it.
If you want to write this from scratch, you could look into writing a finite state machine (states in an enum, a big switch/case block for state switching). You'd have to push the state to a stack since everything can be nested.
I know that this is not the ideal method; I'm just trying to directly address the question.

Categories