A little personal project of mine is to blindly produce a search engine from scratch without using any outside sources. This is mostly for a learning experience and I haven't had much trouble up until now, where I have both a dilemma and a tough problem.
Observe this case:
Suzy wants to search for "fuzzy bears". This is fine, functions as well as it can. However, Suzy screws up and types "fuzzybears". Right now, my search algorithm breaks down since this is interpreted as a single token, and not multiple tokens. Any case or combination of words that has even one occurrence of such a run on term, or glued tokens, causes a poor search result.
For scope, this is something I am writing using a combination of C# and T-SQL.
I've tried multiple solutions, but nothing has really come from them. Firstly, I used a List to take the terms and create variations, but this was much too slow to my liking and required a lot more memory than I feel it should need.
I wanted to save search queries to a database for statistics and maybe to learn more about organically growing the algorithm, so maybe a way to handle these glued tokens in SQL could be a solution, but I have no clue how to start with something like that unless I used a cursor or some other slow solution.
I could take searches, save them to my database, create different combinations where some tokens are glued, and then have those glued tokens as terms to hit on? The issue with this solution is it takes up quite a bit of space and I won't always need these strings since spelling errors like this aren't all too common.
Mainly, what I need is speed. It doesn't really have to be pretty, but if it's fast and accurate then I'm happy even if it takes up a lot of disk space.
Not asking for solutions here, but if anyone can point me in a direction I can go or it would be greatly appreciated.
Consider this approach: since spaces, punctuation, and anything similar would screw up a search like this, remove all of those, convert to a common case (I prefer lowercase, but pick what you prefer), and then tokenize based on syllables, using roughly the same set of division rules as for hyphenating English words.
So, to search for answers that contain "Consider this approach:", you reduce the phrase to "considerthisapproach" and then tokenize as "con","sid","er","this","ap","proach". If con and sid and er appear next to each other, and in that order, you've found the word "consider".
This approach can be adapted for statistical matching too, so e.g. if at least 85% of syllables are found in the correct order, you consider it a close match, and maybe order the results by match % so more meaningful matches are at the top.
Related
I have a file that contains a lot of names of people and some meta for each person.
I'm trying to do a search for the people, given some search query.
So I'm first going to do an exact search (easy, implemented as a Dictionary with a Key Lookup).
If that fails (or fails to reach the max number of records specified), then I was hoping to try and go back over all the keys again and do a StartsWith check.
Now this works fine. But I was curious if there's a better way to do this. Looking around the interwebs it looks like there's a data structure called a Patricia Trie alg.
so ..
Is the Patricia Trie the recommended way of doing a StartsWith check over a large set of string-keys?
If yes, is there a nuget package for this?
No - i do no want to do a Contains check. nope. nada. zilch.
Edit 1:
Also, my data will be loaded in once at app startup so I'm happy to incur a slower 'build' step while creating/populating the data structure. Especially if runtime search-perf is better.
Also, I understand that the word 'performance' depends on heaps of factors, like the number of items stored and the size of the search queries etc. So this is more an academic question which I wish to do some code-comparison vs a microptimisation anti-pattern.
Edit 2:
Also, this previous SO question sorta talks about a similar problem I'm having but the answers seem to be talking about a substring search and not a prefix (i.e. StartsWith) search.
I am currently working on an assignment in which I am to validate various formats using regular expressions (phone numbers, birth date, email address, Social Security). One of the features our teacher has suggested would be to have a method that returns the state an individual was born using their Social Security Number.
xxx-xx-xxxx
The first 3 digits correspond to a state/area as outlined here:
http://socialsecuritynumerology.com/prefixes.php
If I've isolated the first 3 numbers as an integer already, is there anyway I could quickly match the number with its corresponding area code?
Currently I'm only using if-else statements but its getting pretty tedious.
Example:
if (x > 0 && x<3)
return "New Hampshire";
else if (x <= 7)
return "Maine";
...
You have a few options here:
50 if statements, one for each state, as you are doing.
A switch with 999 conditions, matching each option with a state. It probably looks cleaner and you can generate it with a script and interject the return statements wherever necessary. Maybe worse than option 1 in terms of tediousness.
Import the file as text, parse it into a Dictionary and do a simple lookup. The mapping is most likely not going to change in the near future, so the robustness argument is rather moot, but it is probably "simpler" in terms of amount of effort*. And it's another chance to practice regex to parse lines in the file you linked.
*Where "effort" is measured purely in the amount of tedious gruntwork prone to annoying human error and hand fatigue. Energy consumed within the brain due to engineering and implementing a solution where the computer does the ugly stuff for you is not included. :)
It's hard to tell your level of skill and what your course has taught you so far which is why it's difficult answering these kinds of questions, and also for the most part that's why you will get a negative response from people - they assume that you would have had the answer in your course materials already and will assume that you are being lazy.
I'm going to assume that you are at a basic level and that you already know how to solve the problem the brute force way (your if/else construct) and that you are genuinely interested in how to make your code better and not simply asking for a solution you can copy/paste.
Now, while your if/else idea will work, that is a procedural way of thinking. You are working with an object oriented language, so I would suggest to you to think about how you could use the principles of OO to make this work better. A good starting point would be to make a collection of state objects that contain all the parameters you need. You could then loop through your state collection and use their properties to find the matching one. You could create the state collection by reading from a file or database or even just hard coding it for the purposes of your assignment.
Your intuition that a long chain of if and else if statements might be unwieldy is sound. Not only is it tedious, but the search to find the correct interval is inefficient. However, something needs to be in charge of this tedium. A simple solution would be to use a Dictionary to store key/value pairs that you could build up once, and reuse throughout the application. This has the downside of requiring more space than necessary, as every individual mapping becomes an element of the data structure. You could instead follow the advice from this question and use a data structure more suited to ranged values for look ups.
It's difficult to tell from your question as it's written what your level of expertise is, and going into any real detail here would essentially mean completing your assignment for you. It should be noted though, that there's nothing inherently wrong with your solution, and depending on where you are in your education it may be what's expected.
I'm planning on making a casual word game for WP7 using XNA. The game mechanics are fine enough for me to implement but it is just the checking to see if the word they make is actually a word or not.
I thought about having a text file and loading that into memory at the start, but surely this wouldn't be possible to keep in memory for a phone? Also how slow would it be to read from this to see if it is a word. How would they be stored in memory? Would it be best to use a dictionary/hashmap and each key is a word and i just check to see if that key exists? Or would it put them in an array?
Stuck on the best way to implement this, so any input is appreciated. Thanks
Depending on your phones hardware, you could probably just load up a text file into memory. The english language probably has only a couple hundred thousand words. Assuming your average word is around 5 characters or so, thats roughly a meg of data. You will have overhead managing that file in memory, but thats where specifics of hardware matter. BTW, it's not uncommon for current generation of phones to have a gig of RAM.
Please see the following related SO questions which require a text file for a dictionary of words.
Dictionary text file
Putting a text file into memory, even of a whole dictionary, shouldn't be too bad as seth flowers has said. Choosing an appropriate data structure to hold the words will be important.
I would not recommend a dictionary using words as keys... that's kind of silly honestly. If you only have keys and no values, what good is a dictionary? However, you may be on a good track with the Dictionary idea. The first thing I would try would be a Dictionary<char, string[]>, where the key is the first letter, and the value is a list of all words beginning with that letter. Of course, that array will be very long, and search time on the array slow (though lookup on the key should be zippy, as char hashes are unique). The advantage is that, if you use the proper .txt dictionary file and load each word in order, you will know that list is ordered by alphabet. So, you can use efficient search techniques like binary search, or any number of searches formulated for pre-sorted lists. It may not be that slow in the end.
If you want to go further, though, you can use the structure which underlies predictive text. It's called a Patricia Trie, or Radix Trie (Wikipedia). Starting with the first letter, you work your way through all possible branches until you either:
assemble the word the user entered, so it is a valid word
reach the end of the branch; this word does not exist.
'Tries' were made to address this sort of problem. I've never represented one in code, so I'm afraid I can't give you any pointers (ba dum tsh!), but there's likely a wealth of information on how to do it available on the internet. Using a Trie will likely be the most efficient solution, but if you find that an alphabet Dictionary like I mentioned above is sufficiently fast using binary search, you might just want to stick with that for now while you develop the actual gameplay. Getting bogged down with finding the best solution when just starting your game tends to bleed off your passion for getting it done. If you run into performance issues, then you make improvements-- at least that's my philosophy when designing games.
The nice thing is, since Windows Phone supports only essentially 2 different specs, once you test the app and see it runs smoothly on them, you really don't have to worry about optimizing for any worse conditions. So use what works!
P.S.: on Windows Phone, loading text files is tricky. Here is a post on the issue which should help you.
I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
Edit:
I just noticed a comment about this package, but I think it deserves a better place inside this question
I'm looking for a way to search through terabytes of data for patterns matching regexes. The implementation does need to support a lot of the finer capabilities of regexes, such as beginning and end of line data, full TR1 support (preferably with POSIX and/or PCRE support), and the like. We're effectively using this application to test policy regarding storage of potentially sensitive information.
I've looked into indexing solutions, but the majority of the commercial suites don't seem to have the finer regex capabilites we'd like (to date, they've all utterly failed at parsing the complex regexes we're using).
This is a complicated problem because of the sheer mass of the amount of data we have, and the amount of system resources we have to dedicate to the task of scanning (not much, its just checks on policy compliance, so there isn't much of a budget there for hardware).
I looked into Lucene but I'm a little hesitant about using index systems that aren't fully capable of dealing with our regex battery, and while searching through the entire dataset would remedy this problem, we'd have to let the servers chug along at performing these actions for a couple weeks at least.
Any suggestions?
PowerGREP can handle any regular expression and has been designed for exactly this purpose. I've found it to be extremely fast searching through large amounts of data, but I haven't tried it on the order of terabytes yet. But since there's a 30 day trial, it's worth a shot, I'd say.
It's especially powerful when it comes to searching specific parts of files. You can section the file according your own criteria, and then apply another search only on those sections. Plus, it has got very good reporting capabilities.
You might want to take a look at Apache Hadoop. Enormous sites like Yahoo and Facebook use Hadoop for a variety of things, one of them being processing multi-TB of text logs.
In the Hadoop documentation there is an example of a distributed Grep that could be scaled to handle any concievable data set size.
There is also a SequenceFileInputFilter.RegexFilter in the Hadoop API if you wanted to roll your own solution.
I can only offer a high-level answer. Building on Tim's and shadit's answers, use a two-pass approach implemented as a MapReduce algorithm on EC2 or Azure Compute. In each pass the Map could take a chunk of data with an identifier and return to Reduce the identifier if a match is found, else a null value. Scale it as wide as you need to shrink the processing time.
The grep program is highly optimized for regex searching in files, to the point where I would say you could not beat it with any general-purpose regex library. Even that would be impractically slow for searching terabytes, so I think you're out of luck on doing full regex searches.
One option might be to use an indexer as a first-pass to find likely matches, then extract some bytes on either side of each match and run a full regex match on it.
disclaimer: i am not a search expert.
if you really need all the generality of regexps then there's going to be nothing better than trawling through all the data (but see comments below on speeding that up).
however, i would guess that is not really the case. so the first thing to do is see if you can use an index to identify possible documents. if, for example, you know that you all your matches will include a word (any word) then you can index the words, use that to find the (hopefully small) set of documents that include that word, and then use grep or equivalent only on those files.
so, for example, maybe you need to find documents that have "FoObAr" at the start of the line. you would start with a caseless index to identify files that have "foobar" anywhere, and then grep (only) those for "^FoObAr".
next, how to grep as quickly as possible. you're likely going to be limited by io speed. so look at using several disks (there may be no need to use raid - you could just have one thread per disk). also, consider compression. you don't need random access to these files, and if they are text (i assume they are if you are grepping them) then they will compress nicely. that will reduce the amount of data you need to read (and store).
finally, note that if your index doesn't work for ALL queries, then it's probably not worth using. you can "grep" for all expressions in a single pass, and the expensive process is reading the data, not the details of the grep, so even if there is "just one" query that cannot be indexed, and you therefore need to scan everything, then building and using an index is probably not a good use of your time.