One of my clients wants to use a unique code for his items (long story..) and he asked me for a solution. The code will consist in 4 parts in which the first one is the zip code where the item is sent from, the second one is the supplier registration number, the third number is the year when the item is sent and the last part is a three division alphanumeric unique character.
As you can see the first three parts are static fields which will never change for the same sender in the same year. So we can say that the last part is the identifier part for that year. This part is 3-division alpahnumeric which means starting from 000 and ending with ZZZ.
The problem is that my client, for some reasonable reasons, wants this part to be not sequential. For example this is not what he wants:
06450-05-2012-000
06450-05-2012-001
06450-05-2012-002
...
06450-05-2012-ZZY
06450-05-2012-ZZZ
The last part should produced randomly like:
06450-05-2012-A17
06450-05-2012-0BF
06450-05-2012-002
...
06450-05-2012-T7W
06450-05-2012-22C
But it should also non-repetitive. So once a possible id is generated the possibility should be discarded from the selection pool.
I am looking for an effective way to do this.
If I only record selected possibilities and check a newly created one against them there is always a worst case possibility that it keeps producing already selected ones, especially near the end.
If I create all possibilities at once and record them in a table or a file it may take a while after every item creation because it will lookup for a non-selected record. By the way 26 letters + 10 digits means 46.656 possible combinations, and there is a chance that there may be a 4th divison added which means 1.679.616 possible combinations.
Is there a more effective way you can suggest? I will use C# for coding and MS SQL for databese..
If it doesn't have to be random, you could maybe simply choose a fixed but "unpredictable" addend which is relatively prime to 26 + 10 == 36 == 2²·3². This means, just choose a fixed addend divisible by neither 2 nor 3.
Then keep adding this fixed number to your previous serial number every time you need a new serial number. This is to be done modulo 46656 (or 1679616) of course.
Mathematics guarantees you won't get the same number twice (before no more "free" numbers are left).
As the addend, you could use const int addend = 26075 since it's 5 modulo 6.
If you expect to create far less than 36^3 entries for each zip-supplier-year tuple, you should probably just pick a random value for the last field and then check to see if it exists, repeating if it does.
Even if you create half of the maximum number of possible entries, new entries still have an expected value of only one failure. Assuming your database is indexed on the overall identifier, this isn't too great a price to pay.
That said, if you expect to use all but a few possible identifiers, then you should probably create all the possible records in advance. It may sounds like a high cost, but each space in memory storing an unused record will eventually store a real record.
I'd expect the first situation is more likely, but if not, or if there's some other combination of the two, please add a comment with some more information and I'll revise my answer.
I think options depend on the amount of the codes that are going to be used:
If you expect to use most of them within a year, then it is better to pre-generate. If done right, lookup should be really fast. And you are going to have 1.679.616 items per year in your DB anyway, so you will have to do such things right.
On the other hand, is it good that you are expecting to use most of them? It may leave you without codes if there are suddenly more items than expected.
If you expect to use only a small amount, then random+existence check might be a way to go, however it is unclear what amount it should be for that to be best (I am pretty sure it is possible to calculate that though).
Related
I'm currently looking for a way to realize a partial word pattern algorithm in C#. The situation I'm in looks like follows:
I got a textfield for the search pattern. Every time the user enters or deletes a char in this field, an event triggers which re-runs the search algorithm. So in case I want to search for the word "face" in strings like
"Facebook", "Facelifting", ""Faceless Face" (whatever that should be) or in generally ANY real life sentences as strings,
the algorithm would first start running when typing "f" in the field. It then show the most relevant String on top of a list the strings are in. The second time it runs when "fa" is typed, and the list is sorted again. This goes on until "face" is completely typed in the textfield and the list is sorted again.
However I don't know what algorithm could be used. I tried the answer from Alain (Getting the closest string match), a simple Levenshtein-Distance algorithm as well as an self-made algorithm, which calculates the priority via
priority = (length_of_typed_pattern) * (amount_of_substr_matches)
In C#, the latter looks like this:
count = Regex.Matches(Regex.Escape(title), pattern).Count;
priority = pattern.Length * count;
The pattern as well as the title are composed of only lowercase letters.
My conclusions so far:
Hamming distance won't make any sense since the strings are not the same length most of the time
The answer from Alain works fine, but only if at least one word completely matches (you only find a most relevant string/sentence when at least one word is equal with the pattern, so if you have "face" typed and there's a string containing the word "facebook", the string containing "facebook" is almost never a top priority
What other ideas could I try? The goal would be to sort the list of strings the best possible way in the earliest moment (with the fewest letters).
You can look at my implementations in the search-* branches of my repository on http://github.com/croemheld/sprung) in Sprung/WindowMatcher.cs and Sprung/Window.cs.
Thanks for your help.
First of all you need to store frequency related to a string(number of times a particular string is searched) in some place to show most relevant one when searched. If you need to show say k most relevant entries so a Min Heap of size 'k' can be implemented.
Case 1- If a letter is pressed for the first time:-
Step (a) Read all the string starting from a Data base or dictionary and store in some data structure(Say DS1) with a FLAG_VALID(set to 1 initially) which shows that it is valid string for the present search characters(for first letter all the strings will be valid).
As you read strings fill the Min Heap according to their Frequency and an element with certain frequency is inserted only when its frequency is greater than minimum one(i.e. the first element of min Heap).
Step (b) (This step is same for all case to show result) To show results you need to show elements in reverse order than Min Heap i.e. first element in Min Heap will have least priority, so basically we need to delete all elements one by one and show it from last to first.
NOTE:- Min Heap will contain reference to a particular string and so the string and its frequency can be accessed at the same time.
Case 2- Inserting next letters in search box:
Step (a) Search through DS1 in which all strings are present and check FLAG_VALID first. If it is a valid string than compare the string from search box and the string from DS1. Set the flag accordingly(if it is a match-1 or not-0) and fill k-Min Heap as it is empty from last search as in Case 1.
Step (b) is as usual.
Case 3- Deleting a letter in search box:
It is similar to above cases but this time we will need to search for those strings also whose FALG_VALID is 0(i.e string which are invalid).
This is a crude searching method and can be improved using certain Data structure and tweaking the algorithm.
First of let me explain what I'm trying to achieve. The application that I'm making should have the ability to compare two columns of two different tables with eachother. So every cell of the column from the first table should be linked to the best matching cell from the column of the second table. So you would get something like this:
(source: modelbouwforum.nl)
This can easily be achieved by using the Levenshtein's algorithm. So I wrote a test program in c# to see if I can recreate the same results as the image is showing us. I made two array's, one containing the first column of the image and one containing the second column of the image. Every cell of the first column is compared to every cell of the second column, so that means I get 4 iterations on every cell (16 in total). The highest match (the one with the lowest levenshtein distance) of the second column is then linked to the cell of the first column.
The problem:
Let say we have two large columns with 100K rows each, this should get some serious performance issues. Because every cell from the first column need to be matched to every cell of the second column to get the highest possible match, so you have to iterate 100K * 100K = 10 billion times. So I have to create something to avoid iterating 10 billion times.
I did some research about where levenshtein could be used and came across this: http://www.slideshare.net/fullscreen/VasileTopac/fuzzy-hash-map/4. I'm wondering if I am able to create something like the guy did in the link?
Some things to consider:
In such large columns there could be multiple matches on a single cell(the user need to chose the right one). So that means you can't
exclude previously matched cells from the current search in order to bring down the iteration.
In the example the matching/comparison is only done on two columns, however in the future I like to compare a single column from table 1
to all the columns from table 2 (less work for the user). This will be even more time expensive as you can expect.
NOTE:
I'm only using c# for 4 months right now, I'll hope someone can provide me a good starting point (I prefer not get a fully working answer, I rather want to do some research myself to learn from it as well). Thanks for the understanding. English is not my native language, so please feel free to edit my post.
Try to come up with some assumption that always holds true about the matching that can segment it into smaller chunks like:
The first capital alpha character in table 1 must match the first capital alpha character in table 2
You may be able to find some valid assumption that will allow you to pre-process the values into another column:
FirstAlpha1 FirstAlpha2
=========== ===========
P C
S F
C P
F S
Then you could do a simple sort and join (exact match) on this extra value to divide the solution into smaller chunks.
I have this peculiar piece of code that is bothering me,
// exbPtr points to 128-bit unsigned integer
// lgID is a "short" with 0xFFFF being the max value
int hash = (*exbPtr + (int)lgID * 9) & tlpLengthMask;
Initially this "hash table", which is really an array is initialized to 256 elements, and tlpLengthMask is set to 255.
Then there is this mysterious code .. with a comment right above it saying "if we reached here .. there has been a collision". And then it starts looping back again, so looks like this is a hash collision, and re-hashing?
hash = (hash + (int)lgID * 2 + 1) & tlpLengthMask;
In addition, there is a ton of debug code that says that the length of this array should be a power of 2 because we're using mask as a modulus.
Can someone explain what the authors intent was? What is the reasoning behind this?
EDIT -- what I'm trying to discern is why he multiplied by 9, and then why multiply by 2 to re-hash.
There are three possibilities:
1) The original author just constructed the hashing functions more or less randomly, saw that they worked well enough, and left it at that.
2) The original author had test data that well represented the actual data and saw that these functions worked extremely well for his exact application.
3) This code is performing very poorly and his hash table is not operating efficiently at all.
The only real requirement is that the output look evenly distributed over the hash table for whatever input he actually encounters and always produce the same output for the same input. While these kinds of functions generally perform poorly, they may be good enough for this specific application.
By the way, this type of open hashing doesn't work in the face of deletions. For example, say you add one record to the table. Then you go to add a second, but it collides with the first, so you skip forward to add the second. Everything's fine now -- you can find both the first record (directly) and the second record (by skipping over the first when you find it at the second record's hash location).
But if you delete the first record, how do you find the second? When you look at the second record's hash location, you find nothing. Do you try skipping? If so, how many times?
There are workarounds to these problems, but they tend to be very easy to do incorrectly.
I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.
I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?
The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.
For what it's worth, I'm using NHibernate for data access.
I would appreciate any feedback on how I can do this or what my options are.
Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.
The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.
As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.
I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
1 insertion (208)
aexample
bexample
cexample
...
examplex
exampley
examplez
1 deletion (7)
xample
eample
exmple
...
exampl
1 substitution (182)
axample
bxample
cxample
...
examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
1 insertion (8)
?example
e?xample
ex?ample
exa?mple
exam?ple
examp?le
exampl?e
example?
1 deletion (7)
xample
eample
exmple
exaple
examle
exampe
exampl
1 substitution (7)
?xample
e?ample
ex?mple
exa?ple
exam?le
examp?e
exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple (notice missing m).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple with 1 insertion == exa?ple == example with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?
it loads all records from a user-specified table (or tables) into memory and then performs the check
don't do that
Either
Do the match match on the back end
and only return the results you need.
or
Cache the records into memory early
on a take the working set hit and do
the check when you need it.
You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.
Have a look at BK-Trees
Also, instead of using the Hamming distance, consider the Levenshtein distance
The answer you marked as correct..
Note: when i say dictionary.. in this post, i mean hash map .. map..
basically i mean a python dictionary
Another way you can improve its performance by creating an inverted index of words.
So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"
So assume you have word in your db "apple"
So in the "a" dictionary : you add the word "apple"
in the "p" dictionary: you add the word "apple"
in the "l" dictionary: you add the word "apple"
in the "e" dictionary : you add the word "apple"
So, do this for all the words in the dictionary..
Now when the misspelled word is entered..
lets say aplse
you start with "a" and retreive all the words in "a"
then you start with "p" and find the intersection of words between "a" and "p"
then you start with "l" and find the intersection of words between "a", "p" and "l"
and you do this for all the alphabetss.
in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"
In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..
now there might be a case when nothing might be returned..
so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets..
In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.
So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1
then a*lse( intersection between words a,l,s,e)... numwords = k2
and so on..
choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...
There are many sophisticated algorithms built on top of this..
like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)
What kind of algorithm is this, I know pretty much nothing but this is what I'm trying to do in code... I have class 'Item', properties int A and int B -- I have multiple lists of List<Item> with a random amount of Item in each list, incosistent with any other List. I must choose 1 item from each list to get the highest possible value of the sum Item.A while conforming that the sum of Item.B must also be at minimum a certain number. In the future there might also be another property Item.C to conform to that the sum must be equal to a certain number. I have no idea how to write this :(
So to put it this way;
class Item
int A
int B
int C
I have a 10x different List<Item> each with a random number of Item inside
We must find the exactly the best combination to have
a) Highest sum of Item.A
b) Constraint that the sum of Item.B must be higher than X
c) Constraint that the sum of Item.C must be equal to X
I have no idea how to code this to be fast and efficient. :(
As mentionend in my comment, this is a Binary Programming problem, which can be cast as a multi-dimensional Knapsack problem. I would first try to solve it with an off-the-shelf Mixed Integer Programming (MIP) solver like the one suggested by Lieven in one of his comments (lpSolve), given that you "only" have got some 100-200 binary variables. You might have to play a little bit around with the parameters. Some MIP solvers allow you to add search heuristics, which might be helpful. Given your constraints, I must admit I don't have a feeling how long a standard MIP solver will take, but I wouldn't hold my breath.
If a mixed-integer programming solver is not fast enough for you, you want to look at some more specialised algorithms. For your problem, the ones presented in Knapsack Problems, chapter 11.10 on the multiple-choice Knapsack problem (almost exactly your problem) and chapter 9 are relevant.
Edit: based on your comments, the good news is that your data ranges are pretty good and the problem seems solvable in a reasonable time. This paper (DOI in case the link vanishes) presents an algorithm that according to the authors solves problems of your size within seconds (see section 4.4 and 5.1). The bad news is that it contains a lot of math...
I posted this question as an unregistered user and after clicking register, it didn't associate my unregistered user with my registered user, nice =/
In regards to the comment by van:
Typically there will be about 14 lists or so
Within each list there will be usually around 5-15 'Items'
Each item has those 3 properties.
We must exactly 1 item from each list.
We are looking for the maximum value of PropertyA when we calculate the sum of all PropertyA after choosing one item from each list
The constraints are PropertyB and PropertyC which the chosen combination must confirm too, once again using the sum of the values across the combination.
It must also be the most optimal solution, not an approximation.