Related
I am trying to understand "Jaccard similarity" between 2 arrays of type double having values greater than zero and less than one.
Till now i have searched many websites for this but what I found is that the both arrays should be of same size(Number of elements in array 1 should be equal to number of elements in array 2). But I am having different number of elements in both arrays. Is there any way to implement "jaccard similarity" ?
Using C#'s LINQ ...
Say you have an array of doubles named A and another named B. This will give you the Jaccard index:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = (((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
The first statement gets a list of numbers that appear in both arrays. The second computes the index - that is just the size of the intersection (how many numbers appear in both arrays) divided by the size of the union (size, or rather count, of the one array plus the count of the other).
Sorry for necroposting, but the answer above was marked as the correct one. Jaccard similarity coefficient from #AgapwIesu answer can be maximum 0.5 if collections are fully identical. At least, you need to multiply numerator x2 to normalize it, like this:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = 2*(((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
Please note, that this similarity coefficient is not intersection, devided by union as defined at Wikipedia. If you want to get intersection, devided by union using LINQ, you can try this code:
private static double JaccardIndex(IEnumerable<double> A, IEnumerable<double> B)
{
return (double)A.Intersect(B).Count() / (double)A.Union(B).Count();
}
Take into account, that Union and Intersect works with unique objects, so you should be careful working with non-unique collections:
List<int> A = new List<int>() { 1, 1, 1, 1 };
List<int> B = new List<int>() { 1, 1, 1, 1 };
Console.WriteLine(A.Union(B).Count()); // = 1, not 4
Console.WriteLine(A.Intersect(B).Count()); // = 1, not 4
Jaccard similarity is an index of the size of intersection between two sets, divided by the size of the union. In your case, you'd have to write the code to find out how many elements appear in both arrays, then divide that by the sum of the size of both arrays.
The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html
This is fairly 'math-y' but I'm posting here because it's a Project Euler problem, & I have working code that presumably has bugs in it.
The question Determing longest repeating cycle in a decimal expansion solves the problem using logarithms, but I'm interested in solving with simple brute force. More accurately, I'm interested in understanding why my algorithm and code is not returning the correct solution.
The algorithm is simple:
replicate a 'long division',
at each step record the divisor and the remainder
when a divisor / remainder tuple is repeated, infer that the decimal representation will repeat.
Here are private fields, as requested
private int numerator;
private int recurrence;
private int result;
private int resultRecurrence;
private List<dynamic> digits;
and here is the code:
private void Go()
{
foreach (var i in primes)
{
digits = new List<dynamic>();
numerator = 1;
recurrence = 0;
while (numerator != 0)
{
numerator *= 10;
// quotient
var q = numerator / i;
// remainder
var r = numerator % i;
digits.Add(new { Divisor = q, Remainder = r });
// if we've found a repetition then break out
var m = digits.Where(p => p.Divisor == q && p.Remainder == r).ToList();
if (m.Count > 1)
{
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
break;
}
numerator = r;
}
if (recurrence > resultRecurrence)
{
resultRecurrence = recurrence;
result = i;
}
}}
When testing integers < 10 and < 20 I get the correct result; and I correctly identify the value of i as well. However the decimal represetation that I get is incorrect - I calculate i-1 whereas the correct result is far less (something like i-250).
So presumably I either have a programming bug - which I can't find - or a logic bug.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
edit
I'm not going to include my prime number code - it's not relevant, as I explain above I correctly identify the value of i (from memory it is 983) but I'm having problems getting the correct value for resultRecurrence.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
Close.
For all primes except 2 and 5 (which divide 10), the sequence of remainders is formed by starting with 1 and transforming by
remainder = (10 * remainder) % prime
thus the k-th remainder is 10k (mod prime) and the set of remainders forms a subgroup of the group of nonzero remainders modulo prime[1]. The length of the recurring cycle is the order of that subgroup, which is also known as the order of 10 modulo prime.
The order of the group of nonzero remainders modulo prime is prime-1, and there's a theorem by Fermat:
Let G be a finite group of order g and H be a subgroup of G. Then the order h of H divides g.
So the length of the cycle is always a divisor of prime-1, and sometimes it's prime-1, e.g. for 7 or 19.
[1] For composite numbers n coprime to 10, that would be the group of remainders modulo n that are coprime to n.
First off, you don’t need the divisors, you only need the remainders.
Secondly, I would split the function into multiple independent parts instead of having everything in one big method: The long division / finding of the cycle length is independent of the rest (= finding the longest cycle).
Your break on Where coupled with Count is unintuitive. Why not just use a while loop with the condition (! digits.Contains(r))? (This would require putting 0 as a remainder into the digits list before the loop start.)
This leaves us with a much cleaner code that should be straightforward to debug.
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
Surely the value of resultRecurrence is always going to be i-1 ? Since for a fraction of the form 1/n, the decimal starts repeating exactly when the division-in-progress (the ith digit) gives the same quotient-remainder as the very first trial division (1, hence i-1).
(as a side note, may I introduce you to Math.DivRem).
I have some random integers like
99 20 30 1 100 400 5 10
I have to find a sum from any combination of these integers that is closest(equal or more but not less) to a given number like
183
what is the fastest and accurate way of doing this?
If your numbers are small, you can use a simple Dynamic Programming(DP) technique. Don't let this name scare you. The technique is fairly understandable. Basically you break the larger problem into subproblems.
Here we define the problem to be can[number]. If the number can be constructed from the integers in your file, then can[number] is true, otherwise it is false. It is obvious that 0 is constructable by not using any numbers at all, so can[0] is true. Now you try to use every number from the input file. We try to see if the sum j is achievable. If an already achieved sum + current number we try == j, then j is clearly achievable. If you want to keep track of what numbers made a particular sum, use an additional prev array, which stores the last used number to make the sum. See the code below for an implementation of this idea:
int UPPER_BOUND = number1 + number2 + ... + numbern //The largest number you can construct
bool can[UPPER_BOUND + 1]; //can[number] is true if number can be constructed
can[0] = true; //0 is achievable always by not using any number
int prev[UPPER_BOUND + 1]; //prev[number] is the last number used to achieve sum "number"
for (int i = 0; i < N; i++) //Try to use every number(numbers[i]) from the input file
{
for (int j = UPPER_BOUND; j >= 1; j--) //Try to see if j is an achievable sum
{
if (can[j]) continue; //It is already an achieved sum, so go to the next j
if (j - numbers[i] >= 0 && can[j - numbers[i]]) //If an (already achievable sum) + (numbers[i]) == j, then j is obviously achievable
{
can[j] = true;
prev[j] = numbers[i]; //To achieve j we used numbers[i]
}
}
}
int CLOSEST_SUM = -1;
for (int i = SUM; i <= UPPER_BOUND; i++)
if (can[i])
{
//the closest number to SUM(larger than SUM) is i
CLOSEST_SUM = i;
break;
}
int currentSum = CLOSEST_SUM;
do
{
int usedNumber = prev[currentSum];
Console.WriteLine(usedNumber);
currentSum -= usedNumber;
} while (currentSum > 0);
This seems to be a Knapsack-like problem, where the value of your integers would be the "weight" of each item, the "profit" of each item is 1, and you are looking for the least number of items to exactly sum to the maximum allowable weight of the knapsack.
This is a variant of the SUBSET-SUM problem, and is also NP-Hard like SUBSET-SUM.
But if the numbers involved are small, pseudo-polynomial time algorithms exist. Check out:
http://en.wikipedia.org/wiki/Subset_sum_problem
Ok More details.
The following problem:
Given an array of integers, and integers a,b, is there
some subset whose sum lies in the
interval [a,b] is NP-Hard.
This is so because we can solve subset-sum by choosing a=b=0.
Now this problem easily reduces to your problem and so your problem is NP-Hard too.
Now you can use the polynomial time approximation algorithm mentioned in the wiki link above.
Given an array of N integers, a target S and an approximation threshold c,
there is a polynomial time approximation algorithm (involving 1/c) which tells if there is a subset sum in the interval [(1-c)S, S].
You can use this repeatedly (by some form of binary search) to find the best approximation to S you need. Note you can also use this on intervals of the from [S, (1+c)S], while the knapsack will only give you a solution <= S.
Of course there might be better algorithms, in fact I can bet on it. There should be plenty of literature on the web. Some search terms you can use: approximation algorithms for subset-sum, pseudo-polynomial time algorithms, dynamic programming algorithm etc.
A simple-brute-force-method would be to read the text in, parse it into numbers, and then go through all combinations until you find the required sum.
A quicker solution would be to sort the numbers, then...
Add the largest number to your sum, Is it too big? if so, take it off and try the next smallest.
if the sum is too small, add the next largest number and repeat.
Continue adding numbers not letting the sum exceed the target. Finish when you hit the target.
Note that when you backtrack, you may need to back track more than one level. Sounds like a good case for recursion...
If the numbers are large you can turn this into an Integer Programme. Using Mathematicas solver, it might look something like this
nums = {99, 20, 30 , 1, 100, 400, 5, 10};
vars = a /# Range#Length#nums;
Minimize[(vars.nums - 183)^2, vars, Integers]
You can sort the list of values, find the first value that's greater than the target, and start concentrating on the values that are less than the target. Find the sum that's closest to the target without going over, then compare that to the first value greater than the target. If the difference between the closest sum and the target is less than the difference between the first value greater than the target and the target, then you have the sum that's closest.
Kinda hokey, but I think the logic hangs together.
Was wondering how is it possible to generate 512 bit (155 decimal digits) prime number, last five decimal digits of which are specified/fixed (eg. ***28071) ??
The principles of generating simple primes without any specifications are quite understandable, but my case goes further.
Any hints for, at least, where should I start?
Java or C# is preferable.
Thanks!
I guess the only way would be to first generate a random number of 150 decimal digits, then append the 28071 behind it by doing number = randomnumber * 100000 + 28071 then just brute force it out with something like
while (!IsPrime(number))
number += 100000;
Of course this could take awhile to compute ;-)
Did you try just generating such numbers and checking them? I would expect that to be acceptably fast. The prime density decreases only as the logarithm of the number, so I'd expect you to try a few hundred numbers until you hit a prime. ln(2^512) = 354 so about one number in 350 will be prime.
Roughly speaking, the prime number theorem states that if a random number nearby some large number N is selected, the chance of it being prime is about 1 / ln(N), where ln(N) denotes the natural logarithm of N. For example, near N = 10,000, about one in nine numbers is prime, whereas near N = 1,000,000,000, only one in every 21 numbers is prime. In other words, the average gap between prime numbers near N is roughly ln(N)
(from http://en.wikipedia.org/wiki/Prime_number_theorem)
You just need to take care that a number exists for your final digits. But I think that's as easy as checking that the last digit isn't divisible by 2 or 5 (i.e. it is 1, 3, 7 or 9).
According to this performance data you can do about 2000 ModPow operations on 512 bit data per second, and since a simple prime-test is checking 2^(p-1) mod p=1 which is one ModPow operation, you should be able to generate several primes with your properties per second.
So you could do (pseudocode):
BigInteger FindPrimeCandidate(int lastDigits)
{
BigInteger i=Random512BitInt;
int remainder = i % 100000;
int increment = lastDigits-remainder;
i += increment;
BigInteger test = BigInteger.ModPow(2, i - 1, i);
if(test == 1)
return i;
else
return null;
}
And do more extensive prime checks on the result of that function.
As #Doggot said, but start from least possible 150 digit number which ends with 28071, means 100000....0028071, now add it up with 100000 each time and for testing primarily use miller rabin like the code I provided here, It needs some customization. If the return value is true, check it for exact primarily.
You can use a sieve which contains only numbers satisfying your special condition to filter out numbers divisible by small primes.
For each small prime p you need to find the correct starting point and step by taking into account that only each 100000th number is present in the sieve.
For the numbers that survive the sieve you can use BigInteger.isProbablePrime() to check whether it is prime with sufficient probability.
Let ABCDE be the five digits number in base ten, which you are considering. Based on Dirichlet's theorem on arithmetic progressions, if ABCDE and 100000 are coprime, then there are infinitely many primes of the form 100000*k+ABCDE. Since you are looking for prime numbers, neither 2 nor 5 would divide ABCDE anyway, thus ABCDE and 100000 are coprime. So there are infinitely many primes of the form you are considering.
You could extend one of the standard methods for generating large primes by adding an extra constraint, i.e. that the last 5 decimal digits must be correct. Naively, you can just add this as an extra test but it will increase the time to find a suitable prime by 10^5.
Not-so-naively: generate a random 512-bit number then set sufficient low-order bits so that the decimal representation ends with the required sequence. Then continue with the normal primality tests.
I rewrote the brute-force algorithm from the int world to the BigDecimal one with the help of the BigSquareRoot class from http://www.merriampark.com/bigsqrt.htm. (Note that from 1 to 1000 there is said to be exactly 168 primes.)
Sorry, but if you put there your range, i.e. <10154; 10155-1>, you can let your computer work and when you have retired, you may have the result... it is damn slow!
However, you can somehow find at least a part of this useful in combination with the other answers in this thread.
package edu.eli.test.primes;
import java.math.BigDecimal;
public class PrimeNumbersGenerator {
public static void main(String[] args) {
// BigDecimal lowerLimit = BigDecimal.valueOf(10).pow(154); /* 155 digits */
// BigDecimal upperLimit = BigDecimal.valueOf(10).pow(155).subtract(BigDecimal.ONE);
BigDecimal lowerLimit = BigDecimal.ONE;
BigDecimal upperLimit = new BigDecimal("1000");
BigDecimal prime = lowerLimit;
int i = 1;
/* http://www.merriampark.com/bigsqrt.htm */
BigSquareRoot bsr = new BigSquareRoot();
upperLimit = upperLimit.add(BigDecimal.ONE);
while (prime.compareTo(upperLimit) == -1) {
bsr.setScale(0);
BigDecimal roundedSqrt = bsr.get(prime);
boolean isPrimeNumber = false;
BigDecimal upper = roundedSqrt;
while (upper.compareTo(BigDecimal.ONE) == 1) {
BigDecimal div = prime.remainder(upper);
if ((prime.compareTo(upper) != 0) && (div.compareTo(BigDecimal.ZERO) == 0)) {
isPrimeNumber = false;
break;
} else if (!isPrimeNumber) {
isPrimeNumber = true;
}
upper = upper.subtract(BigDecimal.ONE);
}
if (isPrimeNumber) {
System.out.println("\n" + i + " -> " + prime + " is a prime!");
i++;
} else {
System.out.print(".");
}
prime = prime.add(BigDecimal.ONE);
}
}
}
Let's consider brute-force. Take a look at this very interesting text called "The prime number lottery":
http://plus.maths.org/content/prime-number-lottery
Given the last entry in the last table, there are ~2.79*10^14 primes less then 10^16. Thus, approximately every 35th number is a prime in that range.
EDIT: See the comment by CodeInChaos - if you just walk a few thousand 512bit numbers with last 5 digits fixed, you'll find one quickly.