Speech to text in c# - c#

I have a c# program that lets me use my microphone and when I speak, it does commands and will talk back. For example, when I say "What's the weather tomorrow?" It will reply with tomorrows weather.
The only problem is, I have to type out every phrase I want to say and have it pre-recorded. So if I want to ask for the weather, I HAVE to say it like i coded it, no variations. I am wondering if there is code to change this?
I want to be able to say "Whats the weather for tomorrow", "whats tomorrows weather" or "can you tell me tomorrows weather" and it tell me the next days weather, but i don't want to have to type in each phrase into code. I seen something out there about e.Result.Alternates, is that what I need to use?

This cannot be done without involving linguistic resources. Let me explain what I mean by this.
As you may have noticed, your C# program only recognizes pre-recorded phrases and only if you say the exact same words. (As an aside node, this is quite an achievement in itself, because you can hardly say a sentence twice without altering it a bit. Small changes, that is, e.g. in sound frequency or lengths, might not be relevant to your colleagues, but they matter to your program).
Therefore, you need to incorporate a kind of linguistic resource in your program. In other words, make it "understand" facts about human language. Two suggestions with increasing complexity below. All apporaches assume that your tool is capable of tokenizing an audio input stream in a sensible way, i.e. extract words from it.
Pattern matching
To avoid hard-coding the sentences like
Tell me about the weather.
What's the weather tomorrow?
Weather report!
you can instead define a pattern that matches any of those sentences:
if a sentence contains "weather", then output a weather report
This can be further refined in manifold ways, e.g. :
if a sentence contains "weather" and "tomorrow", output tomorrow's forecast.
if a sentence contains "weather" and "Bristol", output a forecast for Bristol
This kind of knowledge must be put into your program explicitly, for instance in the form of a dictionary or lookup table.
Measuring Similarity
If you plan to spend more time on this, you could implement a means for finding the similarity between input sentences. There are many approaches to this as well, but a prominent one is a bag of words, represented as a vector.
In this model, each sentence is represented as a vector, each word in it present as a dimension of the vector. For example, the sentence "I hate green apples" could be represented as
I = 1
hate = 1
green = 1
apples = 1
red = 0
you = 0
Note that the words that do not occur in this particular sentence, but in other phrases the program is likely to encounter, also represent dimensions (for example the red = 0).
The big advantage of this approach is that the similarity of vectors can be easily computed, no matter how multi-dimensional they are. There are several techniques that estimate similarity, one of them is cosine similarity (see for example http://en.wikipedia.org/wiki/Cosine_similarity).
On a more general note, there are many other considerations to be made of course.
For example, some words might be utterly irrelevant to the message you want to convey, as in the following sentence:
I want you to output a weather report.
Here, at least "I", "you" "to" and "a" could be done away with without damaging the basic semantics of the sentence. Such words are called stop words and are discarded early in many tools that perform speech-to-text analysis.
Also note that we started out assuming that your program reliably identifies sound input. In reality, no tool is capable of infallibly identifying speech.
Humans tend to forget that sound actually exists without cues as to where word or sentence boundaries are. This makes so-called disambiguation of input a gargantuan task that is easily underestimated - and ambiguity one of the hardest problems of computational linguistics in general.

For that, the code won't be able to judge that! You need to split the command in text array! Such as
Tomorrow
Weather
What
This way, you will compare it with the text that is present in your computer! Lets say, with the command (what) with type (weather) and with the time (tomorrow).
It is better to read and understand each word, then guess it will work as Google! Google uses the same, they break down the string and compare it.

Related

Convert text to sentence case but with added constraints and grammar

I have a task and i am lost in understanding how to begin. I have to convert a sentence that is always in capital case to sentence case. while that is easy and i have done it , there are some constraints.
for e.g. if there is a common short form like say VAT. this has to remain capital.
If there is something like City Tax, this has to remain as it is. similarly, for some other keywords.
Is there any ai approach I can take? performance is important here since this is for a backend API.
we have data to train if needed but it will require some manual labor. I would love if we could take a logical approach to it as well.
I code in C#.
I would love advice on how to approach this problem :)

Searching for partial substring within string in C#

Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.

Algorithm for Natural-Looking Sentence in English Language

I'm building an application that does sentence checking. Do you know are there any DLLs out there that recognize sentences and their logic and organize sentences correctly? Like put words in a sentence into a correct sentence.
If it's not available, maybe you can suggest search terms that I can research.
There are things called language model and n-gram. I'll try shortly explain what they are.
Suppose you have a huge coolection of correct english sentences. Let's pick one of them:
The quick brown fox jumps over the lazy dog. Let's now look at all the pairs of words (called bigrams) in it:
(the, quick), (quick, brown), (brown, fox), (fox, jumps) and so on...
Having a huge collection of sentences we will have a huge number of bigrams. We now take unique ones and count their frequences (number of time we saw it in correct sentences).
We now have, say
('the', quick) - 500
('quick', brown) - 53
Bigrams with their frequencies called a language model. It shows you how common a certain combination of words is.
So you can build all the possible sentences of your words an count a weight of each of them taking in account language model. A sentence with the max weight is going to be what you need.
Where to take bigrams and their frequencies? Well, google has it.
You can use not just a pair of words, but triples and so on. It will allow you to build more human-like sentences.
There are few NLP(Natural Language Processing) applications available like SharpNLP and some in java.
Few links
http://nlpdotnet.com
http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/
http://sharpnlp.codeplex.com/
This is a very complex subject you are asking for. Its called
computational linguistics or natural language processing which is subject of ongoing research.
Here are a few links to get you started:
http://en.wikipedia.org/wiki/Natural_language_processing
http://en.wikipedia.org/wiki/Computational_linguistics
http://research.microsoft.com/en-us/groups/nlp/
I guess you won't be able to just download a dll and let i flow :)

Creating a "spell check" that checks against a database with a reasonable runtime

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.
I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?
The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.
For what it's worth, I'm using NHibernate for data access.
I would appreciate any feedback on how I can do this or what my options are.
Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.
The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.
As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.
I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
1 insertion (208)
aexample
bexample
cexample
...
examplex
exampley
examplez
1 deletion (7)
xample
eample
exmple
...
exampl
1 substitution (182)
axample
bxample
cxample
...
examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
1 insertion (8)
?example
e?xample
ex?ample
exa?mple
exam?ple
examp?le
exampl?e
example?
1 deletion (7)
xample
eample
exmple
exaple
examle
exampe
exampl
1 substitution (7)
?xample
e?ample
ex?mple
exa?ple
exam?le
examp?e
exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple (notice missing m).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple with 1 insertion == exa?ple == example with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?
it loads all records from a user-specified table (or tables) into memory and then performs the check
don't do that
Either
Do the match match on the back end
and only return the results you need.
or
Cache the records into memory early
on a take the working set hit and do
the check when you need it.
You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.
Have a look at BK-Trees
Also, instead of using the Hamming distance, consider the Levenshtein distance
The answer you marked as correct..
Note: when i say dictionary.. in this post, i mean hash map .. map..
basically i mean a python dictionary
Another way you can improve its performance by creating an inverted index of words.
So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"
So assume you have word in your db "apple"
So in the "a" dictionary : you add the word "apple"
in the "p" dictionary: you add the word "apple"
in the "l" dictionary: you add the word "apple"
in the "e" dictionary : you add the word "apple"
So, do this for all the words in the dictionary..
Now when the misspelled word is entered..
lets say aplse
you start with "a" and retreive all the words in "a"
then you start with "p" and find the intersection of words between "a" and "p"
then you start with "l" and find the intersection of words between "a", "p" and "l"
and you do this for all the alphabetss.
in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"
In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..
now there might be a case when nothing might be returned..
so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets..
In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.
So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1
then a*lse( intersection between words a,l,s,e)... numwords = k2
and so on..
choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...
There are many sophisticated algorithms built on top of this..
like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)

How do I determine if two similar band names represent the same band?

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance
The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.
I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.
soundex may also be useful
In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

Categories