I'm looking for an algorithm than can compare two text messages (let's say forum posts) and identify the similarity in percentage.
What would be the most efficient solution for this purpose?
The idea is to use this algorithm to identify users on a forum who have more than two nicknames, pretending to be different people.
I'm going to build a program that will read all their posts and compare each post from the first account to posts of the second account to find whether they are genuinely two different persons or just two registrations of a single user.
The first thing that came to my mind was the Levenshtein Distance, but it is more focused on words similarities.
You could use tf-idf, but it will probably work better if your corpus contains more than only two documents.
An alternative could be representing the documents (posts) using a vector space model, like:
(w_0, w_1, ..., w_k)
where
k is the total of terms (words) in your document
w_i is the i-th term.
and then compute the Hamming Distance, which basically compares two vectors (arrays) and count the positions where they are different. You can discard stop-words first (i.e. words like prepositions, etc.)
Take in count that the user might change some words, use synonyms, etc. There are lots of models for representing documents, computing similarity between them. Some of them take in count words dependency, which gives more semantic to the process, and others don't.
google-diff-match-patch will be a good choice for you. you can look the demo for testing.
I am tasked with matching free form text to data in a database.
What I mean by freeform, is that it is a textbox, and someone can type something/anything. For the most part, these entries are valid. I would like to find a list of values from a table that resemble what was typed in.
Before you ask, I have no control of said textbox, nor the people that type into it.
I am looking for techniques, not specific technologies.
Things I have tried:
Clearing out the common words from both the criteria as well as the list. ie (the, of, in, etc.)
SOUNDEX function in sql, it is very weak, and not quite helpfull.
The Levenshtein Distance algorithm and am pretty happy with the results, but it still needs lots of polish.
For example I have this list:
The Hobbit: An Unexpected Journey
The Hobbit: The Desolation of Smaug
The Hobbit: There and Back Again
Iron Man 3
Despicable Me 2
Fast & Furious 6
Monsters University
The Hunger Games: Catching Fire
Man of Steel
Gravity
Thor: The Dark World
The Croods
World War Z
The users input could be:
hobit unexpected journ
The word 'hobit' is not spelled right
Expected result:
The Hobbit: An Unexpected Journey
The Hobbit: There and Back Again
The Hobbit: The Desolation of Smaug
hunger game
Expected result:
The Hunger Games: Catching Fire
What I guess I'm asking is what other methods can I use to calculate these results. My Stack is .Net 4.0 and MSSQL 2008 R2
I would try an algorithm like the following:
common words from both the criteria as well as the list. (the, of, in, etc.)
for each criteria word check if it's included in an entry of the list
if it's included, assign some score/value for this criteria word
if it's not included, check the Levenshtein Distance between the criteria word, and any of the word in the enrty of the list you are checking against
then assign a score/value for the lowest Levenshtein Distance you have found (maybe it's better to ignore any Levenshtein Distance higher than 3/4)
when you have checked all the criteria word respect the current entry of the list, check how many word of the current entry are not included in the criteria, and assign a negative score/value for each of these word
sum up all the score/value: now you have a single score/value for these criteria against a single entry of your list
Repeat this for any entry in your list.
If the data you are effectively analysing are films title:
you should add some modifier, like using a multiplying factor on the value/score for the most recent films.
you can speed up things by having 2 lists to check against: one with the most searched/recent films, and a second list with all the other titles (and if you get enough hit by checking the firs list, you can skip the check against the second list)
I have recently started using google places api and am a big noob on it, I have looked around the main docs on how to run a query on the API but seems that It does not support what I want or im looking at the wrong place.
I need to search on a specific place for a specific term for example:
Restaurants and USA
Is this possible or how would I have to go in order to produce it using the API ?
When you do a Places Search: https://developers.google.com/maps/documentation/places/#PlaceSearches
You can specify a types parameter which limits the types of things you are searching for.
Or you can specify a keyword parameter which selects for a certain term across the whole Place record.
For location, your only option is to select a latitude/longitude pair and specify a radius. This won't work for "USA" as the maximum radius is 50000 meters. You could add that as a keyword however. For locations such as cities, you could geocode first to get the lat/long pair:
https://developers.google.com/maps/documentation/geocoding/
I'm building an application that does sentence checking. Do you know are there any DLLs out there that recognize sentences and their logic and organize sentences correctly? Like put words in a sentence into a correct sentence.
If it's not available, maybe you can suggest search terms that I can research.
There are things called language model and n-gram. I'll try shortly explain what they are.
Suppose you have a huge coolection of correct english sentences. Let's pick one of them:
The quick brown fox jumps over the lazy dog. Let's now look at all the pairs of words (called bigrams) in it:
(the, quick), (quick, brown), (brown, fox), (fox, jumps) and so on...
Having a huge collection of sentences we will have a huge number of bigrams. We now take unique ones and count their frequences (number of time we saw it in correct sentences).
We now have, say
('the', quick) - 500
('quick', brown) - 53
Bigrams with their frequencies called a language model. It shows you how common a certain combination of words is.
So you can build all the possible sentences of your words an count a weight of each of them taking in account language model. A sentence with the max weight is going to be what you need.
Where to take bigrams and their frequencies? Well, google has it.
You can use not just a pair of words, but triples and so on. It will allow you to build more human-like sentences.
There are few NLP(Natural Language Processing) applications available like SharpNLP and some in java.
Few links
http://nlpdotnet.com
http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/
http://sharpnlp.codeplex.com/
This is a very complex subject you are asking for. Its called
computational linguistics or natural language processing which is subject of ongoing research.
Here are a few links to get you started:
http://en.wikipedia.org/wiki/Natural_language_processing
http://en.wikipedia.org/wiki/Computational_linguistics
http://research.microsoft.com/en-us/groups/nlp/
I guess you won't be able to just download a dll and let i flow :)
I have a list of addresses in two separate tables that are slightly off that I need to be able to match. For example, the same address can be entered in multiple ways:
110 Test St
110 Test St.
110 Test Street
Although simple, you can imagine the situation in more complex scenerios. I am trying to develop a simple algorithm that will be able to match the above addresses as a key.
For example. the key might be "11TEST" - first two of 110, first two of Test and first two of street variant. A full match key would also include first 5 of the zipcode as well so in the above example, the full key might look like "11TEST44680".
I am looking for ideas for an effective algorithm or resources I can look at for considerations when developing this. Any ideas can be pseudo code or in your language of choice.
We are only concerned with US addresses. In fact, we are only looking at addresses from 250 zip codes from Ohio and Michigan. We also do not have access to any postal software although would be open to ideas for cost effective solutions (it would essentially be a one time use). Please be mindful that this is an initial dump of data from a government source so suggestions of how users can clean it are helpful as I build out the application but I would love to have the best initial I possibly can by being able to match addresses as best as possible.
I'm working on a similar algorithm as we speak, it should handle addresses in Canada, USA, Mexico and the UK by the time I'm done. The problem I'm facing is that they're in our database in a 3 field plaintext format [whoever thought that was a good idea should be shot IMHO], so trying to handle rural routes, general deliveries, large volume receivers, multiple countries, province vs. state vs. county, postal codes vs. zip codes, spelling mistakes is no small or simple task.
Spelling mistakes alone was no small feat - especially when you get to countries that use French names - matching Saint, Sainte, St, Ste, Saints, Saintes, Sts, Stes, Grand, Grande, Grands, Grandes with or without period or hyphenation to the larger part of a name cause no end of performance issues - especially when St could mean saint or street and may or may not have been entered in the correct context (i.e. feminine vs. masculine). What if the address has largely been entered correctly but has an incorrect province or postal code?
One place to start your search is the Levenstein Distance Algorithm which I've found to be really useful for eliminating a large portion of spelling mistakes. After that, it's mostly a case of searching for keywords and comparing against a postal database.
I would be really interested in collaborating with anyone that is currently developing tools to do this, perhaps we can assist each other to a common solution. I'm already part of the way there and have overcome all the issues I've mentioned so far, having someone else working on the same problem would be really helpful to bounce ideas off.
Cheers -
[ben at afsinc dot ca]
If you would prefer tonot develop one and rather use an off-the-shelf product that uses many of the technologies mentioned here, see: http://www.melissadata.com/dqt/matchup-api.htm
Disclaimer: I had a role in its development and work for the company.
In the UK we would use:
House Name or Number (where name includes Flat number for apartment blocks)
Postcode
You should certainly be using the postcode, but in the US I believe your Zip codes cover very wide areas compared to postcodes in the UK. You would therefore need to use the street and city.
Your example wouldn't differentiate between 11 Test Street, 110 - 119 Test Street, etc.
If your company has access to an address lookup system, I would run all the data through that to get the data back in a consistent format, possibly with address keys that can be used for matching.
If I was to take a crack at this I'd convert each address string into a tree using a pre-defined order of operations.
Eg. 110 Test Street Apt 3. Anywhere California 90210 =>
Get the type of address. Eg Street addresses have different formats that rural route addresses and this is different by country.
Given that this is a street address, get the string that represents the type of street and convert that to an enum (eBoulevard, eRoad, etc..)
Given that this is a street address, pull out the street name (store in lower case)
Given that this is a street address, pull out the street number
Given that this is a street address, look for any apartment number (could be before the street number with a dash, could be after "Apt.", etc...)
eStreet //1.an enum of possible address types eg. eStreet, eRuralRoute,...
|
eStreet //2.an enum of street types eg. eStreet, eBlvd, eWay,...
/ | \
Name Number Apt
| | |
test 110 3
Eg. RR#3 Anywhere California 90210 =>
Get the type of address: rural route
Given that this is a rural route address, get the route number
eRuralRoute
|
3
You'll need to do something similar for country state and zip information.
Then compare the resulting trees.
This makes the comparison very simple, however, the code to generate the trees is very tricky. You'd want to test the crap out of it on thousands and thousands of addresses. Your problem is simpler if it is only US addresses you care about; British addresses as already mentioned are quite different, and Canadian address may have French in them (eg. Place D'Arms, Rue Laurent, etc...)
If it is cost-effective for your company to write its own address normalization tool then I'd suggest starting with the USPS address standard. Alternatively, there are any number of vendors offering server side tools and web services to normalize, correct and verify addresses.
My company uses AccuMail Gold for this purpose because it does a lot more than just standardize & correct the address. When we considered the cost of even one week's worth of salary to develop a tool in-house the choice to buy an off-the-shelf product was obvious.
If you dont chose to use an existing system, one idea is to do the following:
Extract numbers from the address line
replace common street words with blanks
create match string
ie: "555 Canal Street":
Extract number gives "555" + "Canal Street"
Replace street words gives "555" + "Canal"
Create match string gives "555Canal"
"Canal st 555" would give the same match string.
By street words i mean words and abbreviations for "street" in your language, for example "st", "st.", "blv", "ave", "avenue", etc etc all are removed from the string.
By extracting numbers and separating them from the string it does not matter if they are first or last.
use an identity for the primary key, this will always be unique and will make it easier to merge duplicates later.
force proper data entry with the user interface. Make them enter each component in its own text box. The house number is entered in own box, the street name in its own box, city in own box, state from select list, etc.. This will make looking for matches easier
have a two process "save"
after initial save, do a search to look up matches, present them with list of possible matches as well as the new one.
after they select the new one save it, if they pick an existing one use that ID
clean the data. Try to strip out "street", "st", "drive", etc and store it as a StreetType char(1) that uses a FK to a table containing the proper abbreviations, so you can build the street.
look into SOUNDEX and DIFFERENCE
I have worked at large companies that maintain mailinig lists, and they did not attempt to do it automatically, they used people to filter out the new from the dups because it is so hard to do. Plan for a merge feature so you can manually merge duplicates when they occur, and ripple the values through the PKs.
You might look into the google maps api and see if you can pass in you address and get a match back. I'm not familiar with it, this is just speculation.