I'm planning on making a casual word game for WP7 using XNA. The game mechanics are fine enough for me to implement but it is just the checking to see if the word they make is actually a word or not.
I thought about having a text file and loading that into memory at the start, but surely this wouldn't be possible to keep in memory for a phone? Also how slow would it be to read from this to see if it is a word. How would they be stored in memory? Would it be best to use a dictionary/hashmap and each key is a word and i just check to see if that key exists? Or would it put them in an array?
Stuck on the best way to implement this, so any input is appreciated. Thanks
Depending on your phones hardware, you could probably just load up a text file into memory. The english language probably has only a couple hundred thousand words. Assuming your average word is around 5 characters or so, thats roughly a meg of data. You will have overhead managing that file in memory, but thats where specifics of hardware matter. BTW, it's not uncommon for current generation of phones to have a gig of RAM.
Please see the following related SO questions which require a text file for a dictionary of words.
Dictionary text file
Putting a text file into memory, even of a whole dictionary, shouldn't be too bad as seth flowers has said. Choosing an appropriate data structure to hold the words will be important.
I would not recommend a dictionary using words as keys... that's kind of silly honestly. If you only have keys and no values, what good is a dictionary? However, you may be on a good track with the Dictionary idea. The first thing I would try would be a Dictionary<char, string[]>, where the key is the first letter, and the value is a list of all words beginning with that letter. Of course, that array will be very long, and search time on the array slow (though lookup on the key should be zippy, as char hashes are unique). The advantage is that, if you use the proper .txt dictionary file and load each word in order, you will know that list is ordered by alphabet. So, you can use efficient search techniques like binary search, or any number of searches formulated for pre-sorted lists. It may not be that slow in the end.
If you want to go further, though, you can use the structure which underlies predictive text. It's called a Patricia Trie, or Radix Trie (Wikipedia). Starting with the first letter, you work your way through all possible branches until you either:
assemble the word the user entered, so it is a valid word
reach the end of the branch; this word does not exist.
'Tries' were made to address this sort of problem. I've never represented one in code, so I'm afraid I can't give you any pointers (ba dum tsh!), but there's likely a wealth of information on how to do it available on the internet. Using a Trie will likely be the most efficient solution, but if you find that an alphabet Dictionary like I mentioned above is sufficiently fast using binary search, you might just want to stick with that for now while you develop the actual gameplay. Getting bogged down with finding the best solution when just starting your game tends to bleed off your passion for getting it done. If you run into performance issues, then you make improvements-- at least that's my philosophy when designing games.
The nice thing is, since Windows Phone supports only essentially 2 different specs, once you test the app and see it runs smoothly on them, you really don't have to worry about optimizing for any worse conditions. So use what works!
P.S.: on Windows Phone, loading text files is tricky. Here is a post on the issue which should help you.
Related
I have a file that contains a lot of names of people and some meta for each person.
I'm trying to do a search for the people, given some search query.
So I'm first going to do an exact search (easy, implemented as a Dictionary with a Key Lookup).
If that fails (or fails to reach the max number of records specified), then I was hoping to try and go back over all the keys again and do a StartsWith check.
Now this works fine. But I was curious if there's a better way to do this. Looking around the interwebs it looks like there's a data structure called a Patricia Trie alg.
so ..
Is the Patricia Trie the recommended way of doing a StartsWith check over a large set of string-keys?
If yes, is there a nuget package for this?
No - i do no want to do a Contains check. nope. nada. zilch.
Edit 1:
Also, my data will be loaded in once at app startup so I'm happy to incur a slower 'build' step while creating/populating the data structure. Especially if runtime search-perf is better.
Also, I understand that the word 'performance' depends on heaps of factors, like the number of items stored and the size of the search queries etc. So this is more an academic question which I wish to do some code-comparison vs a microptimisation anti-pattern.
Edit 2:
Also, this previous SO question sorta talks about a similar problem I'm having but the answers seem to be talking about a substring search and not a prefix (i.e. StartsWith) search.
A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.
I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?
In a future project I will need to implement functionality meant for searching words (either by length or given a set of characters and their position in the word) which will return all words that meet a certain criteria.
In order to do so, I will need language dictionaries that can be easily queryable in LINQ. The first thing I'd like to ask is if anyone knows about good dictionaries to use in this kind of application and environment used.
And I'd also like to ask about good ways to search the said dictionary for a word. Would a hash table help speeding up queries? The thing is that a language dictionary can be quite huge, and knowing I will have plenty of search criteria, what would be a good way to implement such functionality in order to avoid hindering the search speed?
Without knowing the exact set of stuff you are likely to need to optimize for, it's hard to say. The standard data structures for efficiently organizing a large corpus of words for fast retrieval is the "trie" data structure, or, if space efficiency is important (because say you're writing a program for a phone, or other memory-constrained environment) then a DAWG -- a Directed Acyclic Word Graph. (A DAWG is essentially a trie that merges common paths to leaves.)
Other interesting questions that I'd want to know an answer to before designing a data structure are things like: will the dictionary ever change? If it does change, are there performance constraints on how fast the new data needs to be integrated into the structure? Will the structure be used only as a fast lookup device, or would you like to store summary information about the words in it as well? (If the latter then a DAWG is unsuitable, since two words may share the same prefix and suffix nodes.) And so on.
I would search the literature for information on tries, DAWGs and ways to optimize Scrabble programs; clearly Scrabble requires all kinds of clever searching of a corpus of strings, and as a result there have been some very fast variants on DAWG data structures built by Scrabble enthusiasts.
I have recently written an immutable trie data structure in C# which I'm planning on blogging about at some point. I'll update this answer in the coming months if I do end up doing that.
In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?