I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:
110 Test St
110 Test St.
110 Test Street
Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.
For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".
I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.
We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.
I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.
Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?
One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.
I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.
Cheers -
[ben at afsinc dot ca]
If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm
Disclaimer: I had a role in its development and work for the company.
In the UK we would use:
House Name or Number (where name includes Flat number for apartment blocks)
Postcode
You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.
Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.
If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.
If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.
Eg. 110 Test Street Apt 3. Anywhere California 90210 =>
Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
Given that this is a street address, pull out the street name (store in lower case)
Given that this is a street address, pull out the street number
Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)
eStreet //1.an enum of possible address types eg. eStreet, eRuralRoute,...
|
eStreet //2.an enum of street types eg. eStreet, eBlvd, eWay,...
/ | \
Name Number Apt
| | |
test 110 3
Eg. RR#3 Anywhere California 90210 =>
Get the type of address: rural route
Given that this is a rural route address, get the route number
eRuralRoute
|
3
You'll need to do something similar for country state and zip information.
Then compare the resulting trees.
This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)
If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.
My company uses AccuMail Gold for this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.
If you dont chose to use an existing system, one idea is to do the following:
Extract numbers from the address line
replace common street words with blanks
create match string
ie: "555 Canal Street":
Extract number gives "555" + "Canal Street"
Replace street words gives "555" + "Canal"
Create match string gives "555Canal"
"Canal st 555" would give the same match string.
By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.
By extracting numbers and separating them from the string it does not matter if they are first or last.
use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.
force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier
have a two process "save"
after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
after they select the new one save it, if they pick an existing one use that ID
clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.
look into SOUNDEX and DIFFERENCE
I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.
You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.
Related
I am working on a project where the users have to put in the physical address of their organization, in many cases users will put in a PO Box rather than their physical address. I need a way in C# to determine whether or not a user put in a P.O. Box or PO Box (or any other variation of this) rather 29 Maple Street style address. I have had a few thoughts, but I thought I would get some really great feedback here.
Thanks
I would try to parse the address as a string. Then find a 'P.O. Box' or 'PO Box' in the array: if it finds it, the PO BOX should be the next element(s).
You will also need a way to detect the city so you know when to stop. You could use google's geonames (http://www.geonames.org/) as a data base.
It was tough coming up with an appropriate title for my question. First a little background information in case you need it.
*I have a bill that I am trying to read information off of using regexes. I save the information I need into 4 different tables: Account, Utility, Location, and Taxes.
The logic being that each bill has only one account number (account level). Each account number can pertain to multiple utilities (utility level). Each utility can have multiple locations (assume only 1 location for this question). and each location can have more than one Tax.*
So for the bill found HERE
We can see that 4 Taxes (City Sales Tax of 2.97, County Sales Tax of 1.46, State Sales Tax of 3.44, and PPRTA Tax of 1.10) all belong to The 'Electric' Utility. We also see that 4 utilities (Electric, Gas, Water and Wastewater) belong to 1 Account Number, each with their own taxes.
Previously I have been doing something simple like this to capture all of the taxes in one capture group, multiple times: Tax:. \$(.)
What I am trying to accomplish now is to build a regex that Finds all of the taxes only for a given utility. Again, it must be in one capture group with multiple matches.
Here is an example of what I have so far for the Electric taxes: (?:Electric Commercial Service(?:.\n)?.?Tax:.* \$(.)(?:.\n)?.?Total charge this service)*
As you can see, this only picks up the first tax. I can not figure out a way to make it catch every tax between the words "Electric Commercial Service" and the "Total charge this service" pertaining to Electric service.
Thanks!
You can't do it in a single regex in most languages. A capture group will only result in one element in the match array, even if the group is wildcarded.
You need to do it in two steps. First use a regexp (or other means) to extract the portion of the bill for a single utility. Then within that string, you can use the regex
Tax:.* \$([\d.]+)$
to find all the taxes. In PHP, you'd use preg_match_all to find all the matches of this; other languages should have something comparable (maybe involving the g modifier to the regex).
It can be done as a one-liner, it was fun to do but it got ugly:
Gas Commercial Service \([\S\s]+?(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))?(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))?(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))?(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))?(?:[\s]+(?:(?:(?:[\w]+ )*)?(?:[\w]+)?Tax:[xX\d\.\%\s]*?\$[\d\.\s]*?\$([\d\.]*)\s*?))?
Explained demo here: http://regex101.com/r/fI7hU9
for Electric just change the first word
Updated to accept SurTax and alikes.
I am building a travel organiser application in ASP.NET / C#.
At the moment, the user types in their destination, and my application sends the latitude and longitude to the Google Places API, which returns a list of hotels in the destination city.
The application then plots markers on Google Map (v3) for the hotels, but strangely only for some (small) cities. If I try a major city, or even a large town, the map just won't appear at all.
If 20 results are returned for hotels in Reykjavik, the hotels will be shown without a problem. If 20 results are returned for Dublin, Paris, or Glasgow.. (I think you get the picture!), the map won't show.
I have noticed that hotels in these small cities seem to be in a fairly concentrated area, so I have tried zooming out for larger cities, but that still won't work.
Does anybody have any idea why this would be?
Many thanks.
I found the solution to this problem.
The issue was that I was not escaping apostrophe characters when I was reading in hotel and bar names from the Yelp API.
The reason that some cities were displaying and others were not is down to the general language used in that particular locale. Places like Dublin and Paris tend to have a higher number of instances of businesses with an apostrophe in their name (eg. "O'Haras, O'Reillys, L'Entre Potes, etc..), than say Reykjavik or Oslo, which was causing the map script to crash only in certain cities.
For those who didn't know, like me, you can escape apostophes using a backslash.
alert('O\'Neils Bar, Dublin');
In a logistics software which is written in C#, I need to check if given postal code is between one of the ranges at database.
For Germany, for example
Range: 47000-48000
Given Postal Code: 47057
Result: True
for numeric postal codes, it's alright. But what about UK postal codes? W11 2BQ is an example postal code from london.
one of the basic ideas is, converting postal codes to numbers by converting each character into its ascii code and writing left to right simply.
so
W11 2BQ -> 87 49 49 32 50 66 81 -> 87,494,932,506,681
so one simple postal code becomes a very big number and that disturbs me. english postal codes can vary in sizes (up to 8 chars) so this makes the resulting number even bigger.
I use sql server to check if given postal code is in range.
Is there any official technique to deal with UK postal codes for range calculation?
Best,
Alper
'Postal codes' are a completely arbitrary system, instead of attempting to code against it - unless your objective is to write a comprehensive postal code comprehension library - I would strongly recommend finding/paying for/stealing a library that can do what you're asking for.
If it helps, the basic rules for UK postal codes are (as far as I remember from ones I've seen):
[A-Z]{1,2} - 2-character code representing the sorting depot in the locality
\d - subdivision of the sorting depot's jurisdiction
[mandatory space]
\d[A-Z]{2} - alphanumeric code for a contiguous region occupied by a group of 10-100 addresses
It wouldn't surprise me if my summarisation is wrong/incomplete. All the postal codes I've seen are in the form I mention, but I don't know the actual rules, so there could be others in a different format. It's included merely to give a broad appraisal of the nature of the system.
The GeoNames Web Service might be a good place to start. It should be possible to validate postal codes through their API somehow. I think you can export the database too, if you want to write up your own validation logic.
I don't think converting the zip code into ASCII is a good idea. The reason is quite obvious. If you have a ASCII converted value like 1210121 (just an example), the problem is to separate as (12) (10) (121) or (12) (101) (21). This seems to be a lot of work for such little gain.
Although, couldn't you use SQL??
select * from ZIPTable where zipcode IN ('G4543','G3543')
Note: You can get the zipcode values from a subquery.
I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.
Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:
Our database venue name - "The Pig and Whistle"
service 1 - "Pig and Whistle"
service 2 - "The Pig & Whistle"
etc etc
I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.
What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?
Have you seen any examples of something simlar in c#?
UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance
The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.
A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.
I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;
You can either:
Use an API call (namevariations field).
Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.
One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.
While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.
soundex may also be useful
In bioinformatics we use this to compare DNA- or protein sequences all the time.
There are plenty of algorithms, you probably want to look at global alignments.
In this respect the Needleman-Wunsch algorithm is probably what you seek.
If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.