I need to create a function that will tell me if a character is a vowel or a consonant but I need it to be culture independent. In other words, using a string with "aeiou" isn't good enough because some languages use other vowels such as those with accents. Do I have to compile a list of all unicode characters that could be vowels or is there an easier way to do this?
I don't think this is possible. Very few languages have a one-to-one match between characters and sounds to begin with. Take iota - some will pronounce the first i as a vowel, others as a consonant.
The phonetic alphabet is supposed to help with this. See for instance:
http://en.wikipedia.org/wiki/International_Phonetic_Alphabet
You would have to use the phonetic alphabet as an intermediary, and take the vowels from there. Then, however, you still have the problem of translating words into that phonetic alphabet. Some online dictionaries may be able to help you with that, but even then the same word will likely appear multiple times sometimes with different pronunciations, and I don't know if there are any that allow you to hook up through a webservice or if there are any offline options.
http://www.photransedit.com/online/text2phonetics.aspx (example with horrible full-screen ads)
This problem borders on the complexity of translation software, where you would really need some understanding of the context to understand which word you even need to look up and in what database.
So depending on your requirements, you may want to start as simple as you can, but take the above into account. To allow your application to gain precision later on, you could start with making a function that returns the IPA vowels, and then make a lookup table for letters and letter combinations matches them. Then later on you can look towards getting or creating better data.
You can use charts like these as input:
http://www.antimoon.com/how/pronunc-soundsipa.htm
Many language training books also have an overview. I've always liked the 'Teach Yourself ... ' series, as they always have an overview of the sounds of a language.
Related
A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.
I'm planning on making a casual word game for WP7 using XNA. The game mechanics are fine enough for me to implement but it is just the checking to see if the word they make is actually a word or not.
I thought about having a text file and loading that into memory at the start, but surely this wouldn't be possible to keep in memory for a phone? Also how slow would it be to read from this to see if it is a word. How would they be stored in memory? Would it be best to use a dictionary/hashmap and each key is a word and i just check to see if that key exists? Or would it put them in an array?
Stuck on the best way to implement this, so any input is appreciated. Thanks
Depending on your phones hardware, you could probably just load up a text file into memory. The english language probably has only a couple hundred thousand words. Assuming your average word is around 5 characters or so, thats roughly a meg of data. You will have overhead managing that file in memory, but thats where specifics of hardware matter. BTW, it's not uncommon for current generation of phones to have a gig of RAM.
Please see the following related SO questions which require a text file for a dictionary of words.
Dictionary text file
Putting a text file into memory, even of a whole dictionary, shouldn't be too bad as seth flowers has said. Choosing an appropriate data structure to hold the words will be important.
I would not recommend a dictionary using words as keys... that's kind of silly honestly. If you only have keys and no values, what good is a dictionary? However, you may be on a good track with the Dictionary idea. The first thing I would try would be a Dictionary<char, string[]>, where the key is the first letter, and the value is a list of all words beginning with that letter. Of course, that array will be very long, and search time on the array slow (though lookup on the key should be zippy, as char hashes are unique). The advantage is that, if you use the proper .txt dictionary file and load each word in order, you will know that list is ordered by alphabet. So, you can use efficient search techniques like binary search, or any number of searches formulated for pre-sorted lists. It may not be that slow in the end.
If you want to go further, though, you can use the structure which underlies predictive text. It's called a Patricia Trie, or Radix Trie (Wikipedia). Starting with the first letter, you work your way through all possible branches until you either:
assemble the word the user entered, so it is a valid word
reach the end of the branch; this word does not exist.
'Tries' were made to address this sort of problem. I've never represented one in code, so I'm afraid I can't give you any pointers (ba dum tsh!), but there's likely a wealth of information on how to do it available on the internet. Using a Trie will likely be the most efficient solution, but if you find that an alphabet Dictionary like I mentioned above is sufficiently fast using binary search, you might just want to stick with that for now while you develop the actual gameplay. Getting bogged down with finding the best solution when just starting your game tends to bleed off your passion for getting it done. If you run into performance issues, then you make improvements-- at least that's my philosophy when designing games.
The nice thing is, since Windows Phone supports only essentially 2 different specs, once you test the app and see it runs smoothly on them, you really don't have to worry about optimizing for any worse conditions. So use what works!
P.S.: on Windows Phone, loading text files is tricky. Here is a post on the issue which should help you.
I am creating an application in .NET.
I got a running application name http://www.spinnerchief.com/. It did what I needed it to do but but I did not get any help from Google.
I need functional results for my application, where users can give one sentence and then the user can get the same sentence, but have it worded differently.
Here is an example of want I want.
Suppose I put a sentence that is "Pankaj is a good man." The output should be similar to the following one:
Pankaj is a great person.
Pankaj is a superb man.
Pankaj is a acceptable guy.
Pankaj is a wonderful dude.
Pankaj is a superb male.
Pankaj is a good human.
Pankaj is a splendid gentleman
To do this correctly for any arbitrary sentence you would need to perform natural language analysis of the source sentence. You may want to look into the SharpNLP library - it's a free library of natural language processing tools for C#/.NET.
If you're looking for a simpler approach, you have to be willing to sacrifice correctness to some degree. For instance, you could create a dictionary of trigger words, which - when they appear in a sentence - are replaced with synonyms from a thesaurus. The problem with this approach is making sure that you replace a word with an equivalent part of speech. In English, it's possible for certain words to be different parts of speech (verb, adjective, adverb, etc) based on their contextual usage in a sentence.
An additional consideration you'll need to address (if you're not using an NLP library) is stemming. In most languages, certain parts of speech are conjugated/modified (verbs in English) based on the subject they apply to (or the object, speaker, or tense of the sentence).
If all you want to do is replace adjectives (as in your example) the approach of using trigger words may work - but it won't be readily extensible. Before you do anything, I would suggest that you clearly defined the requirements and rules for your problem domain ... and use that to decide which route to take.
For this, the best thing for you to use is WordNet and it's hyponym/hypernym relations. There is a WordNet .Net library. For each word you want to alternate, you can either get it's hypernym (i.e. for person, a hypernym means "person is a kind of...") or hyponym ("X is a kind of person"). Then just replace the word you are alternating.
You will want to make sure you have the correct part-of-speech (i.e. noun, adjective, verb...) and there is also the issue of senses, which may introduce some undesired alternations (sense #1 is the most common).
I don't know anything about .Net, but you should look into using a dictionary function (I'm sure there is one, or at least a library that streamlines the process if there isn't).
Then, you'd have to go through the string, and ommit words like "is" or "a". Only taking words you want to have synonyms for.
After this, its pretty simple to have a loop spit out your sentences.
Good luck.
Question: In terms of program stability and ensuring that the system will actually operate, how safe is it to use chars like ¦, § or ‡ for complex delimiter sequences in strings? Can I reliable believe that I won't run into any issues in a program reading these incorrectly?
I am working in a system, using C# code, in which I have to store a fairly complex set of information within a single string. The readability of this string is only necessary on the computer side, end-users should only ever see the information after it has been parsed by the appropriate methods. Because some of the data in these strings will be collections of variable size, I use different delimiters to identify what parts of the string correspond to a certain tier of organization. There are enough cases that the standard sets of ;, |, and similar ilk have been exhausted. I considered two-char delimiters, like ;# or ;|, but I felt that it would be very inefficient. There probably isn't that large of a performance difference in storing with one char versus two chars, but when I have the option of picking the smaller option, it just feels wrong to pick the larger one.
So finally, I considered using the set of characters like the double dagger and section. They only take up one char, and they are definitely not going to show up in the actual text that I'll be storing, so they won't be confused for anything.
But character encoding is finicky. While the visibility to the end user is meaningless (since they, in fact, won't see it), I became recently concerned about how the programs in the system will read it. The string is stored in one database, while a separate program is responsible for both encoding and decoding the string into different object types for the rest of the application to work with. And if something is expected to be written one way, is possibly written another, then maybe the whole system will fail and I can't really let that happen. So is it safe to use these kind of chars for background delimiters?
Because you must encode the data in a string, I am assuming it is because you are interfacing with other systems. Why not use something like XML or JSON for this rather than inventing your own data format?
With XML you can specify the encoding in use, e.g.:
<?xml version="1.0" encoding="UTF-8"?>
There is very little danger that any system that stores and retrieves Unicode text will alter those specific characters.
The main characters that can be altered in a text transfer process are the end of line markers. For example, FTPing a file from a Unix system to a Windows system in text mode might replace LINE FEED characters for CARRIAGE RETURN + LINE FEED pairs.
After that, some systems may perform a canonical normalization of the text. Combining characters and characters with diacritics on them should not be used unless canonical normalization (either composing or decomposing) is taken into account. The Unicode character database contains information about which transformations are required under these normalization schemes.
That sums up the biggest things to watch out for, and none of them are a problem for the characters that you have listed.
Other transformations that might be made, but are less likely, are case changes and compatibility normalizations. To avoid these, just stay away from alphabetic letters or anything that looks like an alphabetic letter. Some symbols are also converted in a compatibility normalization, so you should check the properties in the Unicode Character Database just to be sure. But it is unlikely that any system will do a compatibility normalization without expressly indicating that it will do so.
In the Unicode Code Charts, cannonical normalizations are indicated by "≡" and compatability normalizations are indicated by "≈".
You could take the same approach as URL or HTML encoding, and replace key chars with sequences of chars. I.e. & becomes &.
Although this results in more chars, it could be pretty efficiently compressed due to the repetition of those sequences.
Well, UNICODE is a standard, so as long as everybody involved (code, db, etc) is using UNICODE, you shouldn't have any problems.
There are rarer characters in the Unicode set. As far as I know, only the chars below 0x32 (space) have special meanings, anything abovde that should be preserved in an NVARCHAR data column.
It is never going to be totally safe unless you have a good specification what characters can and cannot be part of your data.
Remember some of the laws of Murphy:
"Anything that can go wrong will."
"Anything that can't go wrong, will
anyway."
Those characters that definitely will not be used, may eventually be used. When they are, the application will definitely fail.
You can use any character you like as delimiter, if you only escape the values so that character is guaranteed not to appear in them. I wrote an example a while back, showing that you could even use a common character like "a" as delimiter.
Escaping the values of course means that some characters will be represented as two characters, but usually that will still be less of an overhead than using a multiple character delimiter. And more importantly, it's completely safe.
I'm writing a scanner as part of a compiler.
I'm having a major headache trying to write this one portion:
I need to be able to parse a stream of tokens and push them one by one into a vector, ignoring whitespace and tokenizing special symbols (simple case, lets just consider parentheses and braces)
Example:
int main(){ }
should parse into 6 different tokens:
int
main
(
)
{
}
How would you go about solving this? I'm writing this in C++, but a java/C# solution would be appreciated as well.
Some points:
and no, I can't use Boost, I can't guarantee that the libraries will be
available to me. (don't ask...)
I don't want to use lex, or any other special tools. I've never done
this before and just want to try this once to say I've done it.
Stroustrup's book, The C++ Programming Language, has a great example in it about building a lexer/parser for a simple calculator program. It should serve as a good starting point to learn how to do what you want.
Buy a copy of Compilers: Principles, Techniques, and Tools (the Dragon Book). What you're attempting to write is a lexer, not a "scanner".
Why write your own - look at Lex.
If youmust have your own, you just read the input character by character and maintain some minimum state to accumulate identifiers.
The problem itself is not hard. If you can't solve it, you must be burned out, you just need a rest. Look at it again in the morning.
If you really want to learn something from this exercise, just start coding. It doesn't demand a lot of code, so you can fail repeatedly without blowing more than an afternoon.
At this point you'll have a good feel for the problem.
Then look in any random compilers book to see what the "usual" methods are, and you'll grok then immediately.
umm.. I'd just do a while loop with iterators testing each character for type, and only an alpha to non alpha change, dump the string if it's non empty. if it's a non alpha non white space character, I'd just push it onto the token stack, this is really a trivial parsing task. Shoot, I've been meaning to learn lexx/yacc, but the level of parsing you want is really easy. I wrote a html tokenizer once which is more complicated that this.. I mean you are just looking for names, white space and single non alphanumeric characters.. just do it.
If you want to write this from scratch, you could look into writing a finite state machine (states in an enum, a big switch/case block for state switching). You'd have to push the state to a stack since everything can be nested.
I know that this is not the ideal method; I'm just trying to directly address the question.