person voice identification/recognition - c#

i want to record someones voice and then from information i get about his/her voice i recognize if that person speak again! problem is i have no information about what stats(like frequency ) cause difference to human voice, if any one could help me with how i could recognize someones voice?
while i was researching i found various libraries about speech recognition but they could not help me because my problem is very simpler! i just want to recognize The person who speaking not what he is saying.

The problem you describe is not simple since the voice of the same person can sound different (for example if the person has a cold etc.) and/or if the person is speaking louder/faster/slower etc.
Another point is the separation from other sounds (background, other voices etc.).
The quality of the equipment which records the sound is very important - some systems use multiple microphones to achieve good results...
Altogether this is no easy task - esp. if you want to achieve a good detection ratio.
Basically the way to implement this is:
implement robust sound separation
implement a robust sound/voice pattern extraction
create a DB with fingerprint(s) of the voice(s) you want to recognize based on ideal sound setting
define an algorithm for comparison between your stored fingerprint(s) and the extracted/normalized sound/voice pattern (have some thresholds for "probably equal" etc. might be necessary...)
refine your algorithms till you achieve an acceptable detection rate (take the false positive rate into account too!)
For a nice overview see http://www.scholarpedia.org/article/Speaker_recognition

See VoiceID for Linux. It uses Sphinx and other libs and installs pretty easily.

Some help here, maybe: http://www.generation5.org/content/2004/noReco.asp
Based on an open source FFT library ( http://www.exocortex.org/dsp/ ), with some suggestions about how to do speaker verification.

Related

Limiting length of each "line" in Azure's Speech Translation

I am using this code example from Azure's speech translation (in C#) to build a multi-language subtitler for Zoom calls.
https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/quickstart/csharp/dotnet/translate-speech-to-text/helloworld/Program.cs
It's been ages since I've done any coding so I'm trying to get back into it, but what I can't work out is if there is a way to change the way to speech recognizer splits lines. At the moment, it waits until there is a couple of seconds of silence before finalising an answer. I would like it to do that, but also set a time where it will line break, say five seconds or so, if the person is speaking for longer.
Is that possible, does anyone know?
Very sorry if this is a stupid question, I promise I have looked for myself but can't find the right words.
Thanks for asking this. This may not be the ultimate answer to everything you need, but hopefully it will help.
There is a service property you can set that will cause the intermediate results, delivered through the recognizing, event to be more "complete" and not get replaced as the recognition continues.
Here is a reference to the properties available: https://learn.microsoft.com/en-us/javascript/api/microsoft-cognitiveservices-speech-sdk/queryparameternames?view=azure-node-latest
You set the property on the config object like so:
speechTranslationConfig.SetServiceProperty("stableIntermediateThreshold", "3", ServicePropertyChannel.UriQueryParameter);
There is also a property on the translation service that you can set to make it stable as well
speechTranslationConfig.SetServiceProperty("stableTranslation", "true", ServicePropertyChannel.UriQueryParameter);
You might have to play around with the values to determine the right threshold and some languages may have issues if the order of their sentences can change drastically before the end. Japanese is a good example of a language where you might only want to use the final recognition.

calculating fft with complex number in c#

I use this formula to get frequency of a signal but I dont understand how to implement code with complex number? There is "i" in formula that relates Math.Sqrt(-1). How can I code this formula to signal in C# with NAduio library?
If you want to go back to a basic level then:
You'll want to use some form of probabilistic model, something like a hidden Markov model (HMM). This will allow you to test what the user says to a collection of models, one for each word they are allowed to say.
Additionally you want to transform the audio waveform into something that your program can more easily interpret. Something like a fast Fourier transform (FFT) or a wavelet transform (CWT).
The steps would be:
Get audio
Remove background noise
Transform via FFT or CWT
Detect peaks and other features of the audio
Compare these features with your HMMs
Pick the HMM with the best result about a threshold.
Of course this requires you to previously train the HMMs with the correct words.
A lot of languages actually provide Libraries for this that come, built in. One example, in C#.NET, is at this link. This gives you a step by step guide to how to set up a speech recognition program. It also abstracts you away from the low level detail of parsing audio for certain phenomes etc (which frankly is pointless with the amount of libraries there are about, unless you wish to write a highly optimized version).
It is a difficult problem nonetheless and you will have to use a ASR framework to do it. I have done something slightly more complex (~100 words) using Sphinx4. You can also use HTK.
In general what you have to do is:
write down all the words that you want to recognize
determine the syntax of your commands like (direction) (amount)
Then choose a framework, get an acoustic model, generate a dictionary and a language model compatible with that framework. Then integrate the framework into your application.
I hope I have mentioned all important things you need to do. You can google them separately or go to your chosen framework's tutorial.
Your task is relatively simple in terms of speech recognition and you should get good results if you complete it.

Analyzing audio to create Guitar Hero levels automatically

I'm trying to create a Guitar-Hero-like game (something like this) and I want to be able to analyze an audio file given by the user and create levels automatically, but I am not sure how to do that.
I thought maybe I should use BPM detection algorithm and place an arrow on a beat and a rail on some recurrent pattern, but I have no idea how to implement those.
Also, I'm using NAudio's BlockAlignReductionStream which has a Read method that copys byte[] data, but what happens when I read a 2-channels audio file? does it read 1 byte from the first channel and 1 byte from the second? (because it says 16-bit PCM) and does the same happen with 24-bit and 32-bit float?
Beat detection (or more specifically BPM detection)
Beat detection algorithm overview for using a comb filter:
http://www.clear.rice.edu/elec301/Projects01/beat_sync/beatalgo.html
Looks like they do:
A fast Fourier transform
Hanning Window, full-wave rectification
Multiple low pass filters; one for each range of the FFT output
Differentiation and half-wave rectification
Comb filter
Lots of algorithms you'll have to implement here. Comb filters are supposedly slow, though. The wiki article didn't point me at other specific methods.
Edit: This article has information on streaming statistical methods of beat detection. That sounds like a great idea: http://www.flipcode.com/misc/BeatDetectionAlgorithms.pdf - I'm betting they run better in real time, though are less accurate.
BTW I just skimmed and pulled out keywords. I've only toyed with FFT, rectification, and attenuation filters (low-pass filter). The rest I have no clue about, but you've got links.
This will all get you the BPM of the song, but it won't generate your arrows for you.
Level generation
As for "place an arrow on a beat and a rail on some recurrent pattern", that is going to be a bit trickier to implement to get good results.
You could go with a more aggressive content extraction approach, and try to pull the notes out of the song.
You'd need to use beat detection for this part too. This may be similar to BPM detection above, but at a different range, with a band-pass filter for the instrument range. You also would swap out or remove some parts of the algorithm, and would have to sample the whole song since you're not detecting a global BPM. You'd also need some sort of pitch detection.
I think this approach will be messy and will guarantee you need to hand-scrub the results for every song. If you're okay with this, and just want to avoid the initial hand transcription work, this will probably work well.
You could also try to go with a content generation approach.
Most procedural content generation has been done in a trial-and-error manner, with people publishing or patenting algorithms that don't completely suck. Often there is no real qualitative analysis that can be done on content generation algorithms because they generate aesthetics. So you'd just have to pick ones that seem to give pleasing sample results and try it out.
Most algorithms are centered around visual content generation, including terrain, architecture, humanoids, plants etc. There is some research on audio content generation, Generative Music, etc. Your requirements don't perfectly match either of these.
I think algorithms for procedural "dance steps" (if such a thing exists - I only found animation techniques) or Generative Music would be the closest match, if driven by the rhythms you detect in the song.
If you want to go down the composition generation approach, be prepared for a lot of completely different algorithms that are usually just hinted about, but not explained in detail.
E.g.:
http://tones.wolfram.com/about/faqs/howitworks.html
http://research.microsoft.com/en-us/um/redmond/projects/songsmith/

Intercepting TeamSpeak output in C#

first I'd wanna say this is my first question here, I'll try to comply to the asking tips the best I can... Also I couldn't find a way to post this in a specific C# section so I just used it in the title.
What I'm trying to accomplish is intercepting the sound output of TeamSpeak, and figuring out which person on the channel is producing the loudest sound. I had a look at TeamSpeak SDK but it's more intended for building your own VoIP software than fooling around with TeamSpeak itself...
At first I'm just going to make a simple program that shows the names of the persons and a dB bar (or something that represents loudness) next to them.
I was surprised to see there isn't much discussion around this, I think there's a lot of cool snippets to be made (this one will be for kicking spastics swiftly.)

Sentence generator using Thesaurus

I am creating an application in .NET.
I got a running application name http://www.spinnerchief.com/. It did what I needed it to do but but I did not get any help from Google.
I need functional results for my application, where users can give one sentence and then the user can get the same sentence, but have it worded differently.
Here is an example of want I want.
Suppose I put a sentence that is "Pankaj is a good man." The output should be similar to the following one:
Pankaj is a great person.
Pankaj is a superb man.
Pankaj is a acceptable guy.
Pankaj is a wonderful dude.
Pankaj is a superb male.
Pankaj is a good human.
Pankaj is a splendid gentleman
To do this correctly for any arbitrary sentence you would need to perform natural language analysis of the source sentence. You may want to look into the SharpNLP library - it's a free library of natural language processing tools for C#/.NET.
If you're looking for a simpler approach, you have to be willing to sacrifice correctness to some degree. For instance, you could create a dictionary of trigger words, which - when they appear in a sentence - are replaced with synonyms from a thesaurus. The problem with this approach is making sure that you replace a word with an equivalent part of speech. In English, it's possible for certain words to be different parts of speech (verb, adjective, adverb, etc) based on their contextual usage in a sentence.
An additional consideration you'll need to address (if you're not using an NLP library) is stemming. In most languages, certain parts of speech are conjugated/modified (verbs in English) based on the subject they apply to (or the object, speaker, or tense of the sentence).
If all you want to do is replace adjectives (as in your example) the approach of using trigger words may work - but it won't be readily extensible. Before you do anything, I would suggest that you clearly defined the requirements and rules for your problem domain ... and use that to decide which route to take.
For this, the best thing for you to use is WordNet and it's hyponym/hypernym relations. There is a WordNet .Net library. For each word you want to alternate, you can either get it's hypernym (i.e. for person, a hypernym means "person is a kind of...") or hyponym ("X is a kind of person"). Then just replace the word you are alternating.
You will want to make sure you have the correct part-of-speech (i.e. noun, adjective, verb...) and there is also the issue of senses, which may introduce some undesired alternations (sense #1 is the most common).
I don't know anything about .Net, but you should look into using a dictionary function (I'm sure there is one, or at least a library that streamlines the process if there isn't).
Then, you'd have to go through the string, and ommit words like "is" or "a". Only taking words you want to have synonyms for.
After this, its pretty simple to have a loop spit out your sentences.
Good luck.

Categories