I have a list of ~20,000 email addresses, some of which I know to be fraudulent attempts to get around a "1 per e-mail" limit, such as username1#gmail.com, username1a#gmail.com, username1b#gmail.com, etc. I want to find similar email addresses for evaluation. Currently I'm using a Levenshtein algorithm to check each e-mail against the others in the list and report any with an edit distance of less than 2. However, this is painstakingly slow. Is there a more efficient approach?
The test code I'm using now is:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Threading;
namespace LevenshteinAnalyzer
{
class Program
{
const string INPUT_FILE = #"C:\Input.txt";
const string OUTPUT_FILE = #"C:\Output.txt";
static void Main(string[] args)
{
var inputWords = File.ReadAllLines(INPUT_FILE);
var outputWords = new SortedSet<string>();
for (var i = 0; i < inputWords.Length; i++)
{
if (i % 100 == 0)
Console.WriteLine("Processing record #" + i);
var word1 = inputWords[i].ToLower();
for (var n = i + 1; n < inputWords.Length; n++)
{
if (i == n) continue;
var word2 = inputWords[n].ToLower();
if (word1 == word2) continue;
if (outputWords.Contains(word1)) continue;
if (outputWords.Contains(word2)) continue;
var distance = LevenshteinAlgorithm.Compute(word1, word2);
if (distance <= 2)
{
outputWords.Add(word1);
outputWords.Add(word2);
}
}
}
File.WriteAllLines(OUTPUT_FILE, outputWords.ToArray());
Console.WriteLine("Found {0} words", outputWords.Count);
}
}
}
Edit: Some of the stuff I'm trying to catch looks like:
01234567890#gmail.com
0123456789#gmail.com
012345678#gmail.com
01234567#gmail.com
0123456#gmail.com
012345#gmail.com
01234#gmail.com
0123#gmail.com
012#gmail.com
You could start by applying some prioritization to which emails to compare to one another.
A key reason for the performance limitations is the O(n2) performance of comparing each address to every other email address. Prioritization is the key to improving performance of this kind of search algorithm.
For instance, you could bucket all emails that have a similar length (+/- some amount) and compare that subset first. You could also strip all special charaters (numbers, symbols) from emails and find those that are identical after that reduction.
You may also want to create a trie from the data rather than processing it line by line, and use that to find all emails that share a common set of suffixes/prefixes and drive your comparison logic from that reduction. From the examples you provided, it looks like you are looking for addresses where a part of one address could appear as a substring within another. Tries (and suffix trees) are an efficient data structure for performing these types of searches.
Another possible way to optimize this algorithm would be to use the date when the email account is created (assuming you know it). If duplicate emails are created they would likely be created within a short period of time of one another - this may help you reduce the number of comparisons to perform when looking for duplicates.
Well you can make some optimizations, assuming that the Levenshtein difference is your bottleneck.
1) With a Levenshtein distance of 2, the emails are going to be within 2 characters length of one another, so don't bother to do the distance calculations unless abs(length(email1)-length(email2)) <= 2
2) Again, with a distance of 2, there are not going to be more than 2 characters different, so you can make HashSets of the characters in the emails, and take the length of the union minus the length of the intersection of the two. (I believe this is a SymmetricExceptWith) If the result is > 2, skip to the next comparison.
OR
Code your own Levenshtein distance algorithm. If you are only interested in lengths < k, you can optimize the run time. See "Possible Improvements" on the Wikipedia page: http://en.wikipedia.org/wiki/Levenshtein_distance.
You could add a few optimizations:
1) Keep a list of known frauds and compare to that first. After you get going in your algorithm, you might be able hit against this list faster than you hit the main list.
2) Sort the list first. It won't take too long (in comparison) and will increase the chance of matching the front of the string first. Have it sort by domain name first, then by username. Perhaps put each domain in its own bucket, then sort and also compare against that domain.
3) Consider stripping the domain in general. spammer3#gmail.com and spammer3#hotmail.com will never trigger your flag.
If you can define a suitable mapping to some k-dimensional space, and a suitable norm on that space, this reduces to the All Nearest Neighbours Problem which can be solved in O(n log n) time.
Finding such a mapping, however, might be difficult. Maybe someone will take this partial answer and run with it.
Just for completeness, you should consider the semantics of email addresses as well, in terms of:
Gmail treats user.name and username as being the same, so both are valid email addresses belonging to the same user. Other services may do this as well. LBushkin's suggestion to strip special characters would help here.
Sub-adrressing can potentially trip your filter if users wise up to it. You'd want to drop the sub-address data before comparison.
You might want to look at the full data set to see if there is other commonality between accounts that have spoofed emails.
i don't know what your application does, but if there are other key points, then use those to filter down what addresses you are going to compare.
Sort everything into a hashtable first. The key should be the domain name of the email; "gmail.com". Strip out special characters from the values, as was mentioned above.
Then check all the gmail.com's against one another. That should be much faster. Do not compare things that are more than 3 characters different in length.
As a second step, check all the keys against one another, and develop groupings there. (gmail.com == googlemail.com, for example.)
I agree with others comments about comparing email addresses not being to helpful, since users could just aswell create fraudulent disimilar looking addresses.
I think a better to come with other solutions, such as limiting the amount of emails you can write down per hour/day, or the time between those addresses being received by you and being sent to the users. Basically, work it out in a way where it is comfortable to send to send a few invites per day, but a PITA to send out many. I guess most users would forget/give up to do it if they had to do it through out a relatively long period of time in order to get their freebies.
Is there any way you can do a check on the IP address of the person creating the email. That would be a simple way to determine, or at least give you added information about whether the different email addresses have come from the same person.
Related
I have a variable (serial_and_username_and_subType) that contains this type of text:
CT-AF-23-GQG %username1% *subscriptionType*
DHR-345349-E %username2% *subscriptionType*
C3T-AF434-234-GQG %username3% *subscriptionType*
34-7-HHDHFD-DHR-345349-E %username4% *subscriptionType*
example: ST-NN1-CQ-QQQ-G12 %RandomDUDE12% *Lifetime*
after that, i have an IF instruction that checks if the user inputs something that is present in serial_and_username_and_subType.
if (userInput.Contains
(serial_and_username_and_subType))......
then, what i would like to do (but i am having troubles) is that when someone enters a serial, the program prints the corrispective username and subscription:
for example:
Please enter your Serial:
> ST-NN1-CQ-QQQ-G12
Welcome, RandomDUDE12!
You currently have a Lifetime subscription!
does anyone know a method or a way to obtain what i need?
You are already using Contains(). The other things you could use are
Substring()
Split()
IndexOf()
Split() is probably the easiest one as long as you can guarantee that neither % nor * are part of the serial, username or license:
var s = "ST-NN1-CQ-QQQ-G12 %RandomDUDE12% *Lifetime*";
var splitPercent = s.Split('%');
Console.WriteLine(splitPercent[1]);
var splitStar = s.Split('*');
Console.WriteLine(splitStar[1]);
This approach will work fine as long as you have few licenses only (maybe a few thousand are ok, because PCs are fast). If you have many licenses (like millions), you probably want to separate all that information so that it is not in a string, but a data structure. You would then use a dictionary and access the information directly, instead of iterating through all of them.
I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?
Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.
It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.
Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.
I have a bunch of txt files that contains 300k lines. Each line has a URL. E.g. http://www.ieee.org/conferences_events/conferences/conferencedetails/index.html?Conf_ID=30718
In some string[] array I have a list of web-sites
amazon.com
google.com
ieee.org
...
I need to check whether that URL contains one of web-sites and update some counter that corresponds to certain web-site?
For now I'm using contains method, but it is very slow. There are ~900 records in array, so Worst case is 900*300K(for 1 file). I believe, that indexOf will be slow as well.
Can someone help me with faster approach? Thank you in advance
Good solution would leverage hashing. My approach would be following
Hash all your known hosts (the string[] collection that you mention)
Store the hash in a List<int> (hashes.Add("www.ieee.com".GetHashCode())
Sort the list (hashes.Sort())
When looking up a url:
Parse out host name from the url (get ieee.com from http://www.ieee.com/...). You can use new Uri("http://www.ieee.com/...").Host to get www.ieee.com.
Preprocess it to always expect same case. Use lower case (if you have http://www.IEee.COM/ take www.ieee.com)
Hash parsed host name, and look for it in the hashes list. Use BinarySearch method to find the hash.
If the hash exists, then you have this host in your list
Even faster, and memory efficient way is to use Bloom filters. I suggest you read about them on wikipedia, and there's even a C# implementation of bloom filter on CodePlex. Of course, you need to take into account that bloom filter allows false positive results (it can tell you that a value is in a collection even though it's not), so it's used for optimization only. It does not tell you that something is not in a collection if it is really not.
Using a Dictionary<TKey, TValue> is also an option, but if you only need to count number of occurrences, it's more efficient to maintain collection of hashes yourself.
Create a Dictionary of domain to counter.
For each URL, extract the domain (I'll leave that part to you to figure out), then look up the domain in the Dictionary and increment the counter.
I assume we're talking about domains since this is what you showed in your array as examples. If this can be any part of the URL instead, storing all your strings in a trie-like structure could work.
You can read this question, the answers will be help you:
High performance "contains" search in list of strings in C#
Well in a sort of similar need, though with indexof, I achieved a huge performance improvement with a simple loop
as in something like
int l = url.length;
int position = 0;
while (position < l)
{
if (url[i] == website[0])
{
//test rest of web site from position in an other loop
if (exactMatch(url,position, website))
}
}
Seems a bit wrong but in extreme cases searching for a set of strings (about 10) in a large structured (1.2Mb) file (so regex was out), I went from 3 minutes, to < 1 second.
Your problem as you describe it should not involve searching for substrings at all. Split your source file up into lines (or read it in line by line) which you already know will each contain a URL, and run it through some function to extract the domain name, then compare this with some fast access tally of your target domains such as a Dictionary<string, int>, incrementing as you go, e.g.:
var source = Enumerable.Range(0, 300000).Select(x => Guid.NewGuid().ToString()).Select(x => x.Substring(0, 4) + ".com/" + x.Substring(4, 10));
var targets = Enumerable.Range(0, 900).Select(x => Guid.NewGuid().ToString().Substring(0, 4) + ".com").Distinct();
var tally = targets.ToDictionary(x => x, x => 0);
Func<string, string> naiveDomainExtractor = x=> x.Split('/')[0];
foreach(var line in source)
{
var domain = naiveDomainExtractor(line);
if(tally.ContainsKey(domain)) tally[domain]++;
}
...which takes a third of a second on my not particularly speedy machine, including generation of test data.
Admittedly your domain extractor maybe a bit more sophisticated but it will probably not be very processor intensive, and if you've got multiple cores at your disposal you can speed things up further by using a ConcurrentDictionary<string, int> and Parallel.ForEach.
You'd have to test the performance but you might try converting the urls to the actual System.Uri object.
Store the list of websites as a HashSet<string> - then use the HashSet to look up the Uri's Host:
IEnumerable<Uri> inputUrls = File.ReadAllLines(#"c:\myFile.txt").Select(e => new Uri(e));
string[] myUrls = new[] { "amazon.com", "google.com", "stackoverflow.com" };
HashSet<string> urls = new HashSet<string>(myUrls);
IEnumerable<Uri> matches = inputUrls.Where(e => urls.Contains(e.Host));
I have to do a program in C# Form, which has to load from a file which looks something like that:
100ACTGGCTTACACTAATCAAG
101TTAAGGCACAGAAGTTTCCA
102ATGGTATAAACCAGAAGTCT
...
120GCATCAGTACGTACCCGTAC
20 lines formed with a number (ID) and 20 letters (ADN); the other file looks like that:
TGCAACGTGTACTATGGACC
In few words, this is a game where a murder is done, there are 20 people; i have to load and split the letters and.. i have to compare them and in the end i have to find the best match.
I have no idea how to do that, I don't know how to load the letters in the array and then to split them.. and then to compare them.
What you want to do here, is use something like a calculation of the Levenshtein distance between the strings.
In simple terms, that provides a count of how many single letters you have to change for a string to become equal to another. In the context of DNA or Proteins, this can be interpreted as representing the number of mutations between two individuals or samples. A shorter distance will therefore indicate a closer relationship between the two.
The algorithm can be fairly heavy computationally, but will give you a good answer. It's also quite fun and enlightening to implement. You can find a couple of ways of implementing it under the wikipedia article.
If you find it challenging to understand how it works, I recommend you set up an example grid by hand, with one short string horizontally along the top, and one vertically along the left side, and try going through the calculations manually, just to understand the concept properly (it can be confusing at first, but is really not that difficult).
This is a simple match function. It might not be of the complexity your game requires. This solution does not require an explicit split on the strings in order to get an array of DNA "letters". The DNA is compared in place.
Compare each "suspect" entry to the "evidence one.
int idLength = 3;
string evidence = //read from file
List<string> suspects = //read from file
List<double> matchScores = new List<double>();
foreach (string suspect in suspects)
{
int count = 0;
for (int i = idLength; i < suspect.Length; i++)
{
if (suspect[i + idLength] == evidence[i]) count++;
}
matchScores.Add(count * 100 / evidence.Length);
}
The matchScores list now contains all the individual match scores. I did not save the maximum match score in a separate variable as there can be several "suspects" with the same score. To find out which subject has the best match, just iterate the matchScores list. The index of the best match is the index of the suspect in the suspects list.
Optimization notes:
you could check each "suspect" string to see where (i.e. at what index does) the DNA sequence starts, as it could be variable;
a dictionary could be used here, instead of two lists, with the "suspect string" as key and the match score as value
I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.
I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?
The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.
For what it's worth, I'm using NHibernate for data access.
I would appreciate any feedback on how I can do this or what my options are.
Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.
The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.
As Darcara said, a BK-Tree is a good first take. They are very easy to implement. There are several free implementations easily found via Google, but a better introduction to the algorithm can be found here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
Unfortunately, calculating the Levenshtein distance is pretty costly, and you'll be doing it a lot if you're using a BK-Tree with a large dictionary. For better performance, you might consider Levenshtein Automata. A bit harder to implement, but also more efficient, and they can be used to solve your problem. The same awesome blogger has the details: http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata. This paper might also be interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652.
I guess the Levenshtein distance is more useful here than the Hamming distance.
Let's take an example: We take the word example and restrict ourselves to a Levenshtein distance of 1. Then we can enumerate all possible misspellings that exist:
1 insertion (208)
aexample
bexample
cexample
...
examplex
exampley
examplez
1 deletion (7)
xample
eample
exmple
...
exampl
1 substitution (182)
axample
bxample
cxample
...
examplz
You could store each misspelling in the database, and link that to the correct spelling, example. That works and would be quite fast, but creates a huge database.
Notice how most misspellings occur by doing the same operation with a different character:
1 insertion (8)
?example
e?xample
ex?ample
exa?mple
exam?ple
examp?le
exampl?e
example?
1 deletion (7)
xample
eample
exmple
exaple
examle
exampe
exampl
1 substitution (7)
?xample
e?ample
ex?mple
exa?ple
exam?le
examp?e
exampl?
That looks quite manageable. You could generate all these "hints" for each word and store them in the database. When the user enters a word, generate all "hints" from that and query the database.
Example: User enters exaple (notice missing m).
SELECT DISTINCT word
FROM dictionary
WHERE hint = '?exaple'
OR hint = 'e?xaple'
OR hint = 'ex?aple'
OR hint = 'exa?ple'
OR hint = 'exap?le'
OR hint = 'exapl?e'
OR hint = 'exaple?'
OR hint = 'xaple'
OR hint = 'eaple'
OR hint = 'exple'
OR hint = 'exale'
OR hint = 'exape'
OR hint = 'exapl'
OR hint = '?xaple'
OR hint = 'e?aple'
OR hint = 'ex?ple'
OR hint = 'exa?le'
OR hint = 'exap?e'
OR hint = 'exapl?'
exaple with 1 insertion == exa?ple == example with 1 substitution
See also: How does the Google “Did you mean?” Algorithm work?
it loads all records from a user-specified table (or tables) into memory and then performs the check
don't do that
Either
Do the match match on the back end
and only return the results you need.
or
Cache the records into memory early
on a take the working set hit and do
the check when you need it.
You will need to structure your data differently than a database can. Build a custom search tree, with all dictionary data needed, on the client. Although memory might become a problem if the dictionary is extremely big, the search itself will be very fast. O(nlogn) if I recall correctly.
Have a look at BK-Trees
Also, instead of using the Hamming distance, consider the Levenshtein distance
The answer you marked as correct..
Note: when i say dictionary.. in this post, i mean hash map .. map..
basically i mean a python dictionary
Another way you can improve its performance by creating an inverted index of words.
So rather than calculating the edit distance against whole db, you create 26 dictionary.. each has a key an alphabet. so english language has 26 alphabets.. so keys are "a","b".. "z"
So assume you have word in your db "apple"
So in the "a" dictionary : you add the word "apple"
in the "p" dictionary: you add the word "apple"
in the "l" dictionary: you add the word "apple"
in the "e" dictionary : you add the word "apple"
So, do this for all the words in the dictionary..
Now when the misspelled word is entered..
lets say aplse
you start with "a" and retreive all the words in "a"
then you start with "p" and find the intersection of words between "a" and "p"
then you start with "l" and find the intersection of words between "a", "p" and "l"
and you do this for all the alphabetss.
in the end you will have just the bunch of words which are made of alphabets "a","p","l","s","e"
In the next step, you calculate the edit distance between the input word and the bunch of words returned by the above steps.. thus drastically reducing your run time..
now there might be a case when nothing might be returned..
so something like "aklse".. there is a good chance that there is no word which is made of just these alphabets..
In this case, you will have to start reversing the above step to a stage where you have finite numbers of word left.
So somethng like start with *klse (intersection between words k, l,s,e) num(wordsreturned) =k1
then a*lse( intersection between words a,l,s,e)... numwords = k2
and so on..
choose the one which have higher number of words returned.. in this case, there is really no one answer.. as a lot of words might have same edit distance.. you can just say that if editdistance is greater than "k" then there is no good match...
There are many sophisticated algorithms built on top of this..
like after these many steps, use statistical inferences (probability the word is "apple" when the input is "aplse".. and so on) Then you go machine learning way :)