Implementing jaccard similarity in c#

Implementing jaccard similarity in c# - c#

I am trying to understand "Jaccard similarity" between 2 arrays of type double having values greater than zero and less than one.
Till now i have searched many websites for this but what I found is that the both arrays should be of same size(Number of elements in array 1 should be equal to number of elements in array 2). But I am having different number of elements in both arrays. Is there any way to implement "jaccard similarity" ?

Using C#'s LINQ ...
Say you have an array of doubles named A and another named B. This will give you the Jaccard index:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = (((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
The first statement gets a list of numbers that appear in both arrays. The second computes the index - that is just the size of the intersection (how many numbers appear in both arrays) divided by the size of the union (size, or rather count, of the one array plus the count of the other).

Sorry for necroposting, but the answer above was marked as the correct one. Jaccard similarity coefficient from #AgapwIesu answer can be maximum 0.5 if collections are fully identical. At least, you need to multiply numerator x2 to normalize it, like this:
var CommonNumbers = from a in A.AsEnumerable<double>()
join b in B.AsEnumerable<double>() on a equals b
select a;
double JaccardIndex = 2*(((double) CommonNumbers.Count()) /
((double) (A.Count() + B.Count())));
Please note, that this similarity coefficient is not intersection, devided by union as defined at Wikipedia. If you want to get intersection, devided by union using LINQ, you can try this code:
private static double JaccardIndex(IEnumerable<double> A, IEnumerable<double> B)
{
return (double)A.Intersect(B).Count() / (double)A.Union(B).Count();
}
Take into account, that Union and Intersect works with unique objects, so you should be careful working with non-unique collections:
List<int> A = new List<int>() { 1, 1, 1, 1 };
List<int> B = new List<int>() { 1, 1, 1, 1 };
Console.WriteLine(A.Union(B).Count()); // = 1, not 4
Console.WriteLine(A.Intersect(B).Count()); // = 1, not 4

Jaccard similarity is an index of the size of intersection between two sets, divided by the size of the union. In your case, you'd have to write the code to find out how many elements appear in both arrays, then divide that by the sum of the size of both arrays.

Related

Should I use Sum method and Count/Length find the element of array that is the closest to the middle value of all elements?

If I have arr=[1,3,4,-7,9,11], the average value is (1+3+4-7+9+11) /6 = 3.5, then elements 3 and 4 are equally distant to 3.5, but smaller of them is 3 so 3 is the result.

You need to find out what the average is first. That involves a cycle either implemented explicitly, or invoked implicitly. So, let's assume that you already know what the average value is, because your question refers to the way some values related to the average can be obtained. Let's implement a comparison function:
protected bool isBetter(double a, double b, double avg) {
double absA = Abs(a - avg);
double absB = Abs(b - avg);
if (absA < absB) return a;
else if (absA > absB) return b;
return (a < b) ? a : b;
}
And now you can iterate your array, always compare via isBetter the current value with the best so far and if it's better, then it will be the new best. Whatever number ended up to be the best will be the result.

Assuming you have worked out the average (avg below) then you can get the diff for each item, then order by the diff, and get the first item. This will give you the closest item in the array
var nearestDiff = arr.Select(x => new { Value=x, Diff = Math.Abs(avg-x)})
.OrderBy(x => x.Diff)
.First().Value;
Live example: https://dotnetfiddle.net/iKvmhp
If instead you must get the item lower than the average
var lowerDiff = arr.Where(x => x<avg)
.OrderByDescending(x =>x)
.First();
You'll need using System.Linq for either of the above to work

Using GroupBy is a good way to do it
var arr = new int[] { 1, 4, 3, -7, 9, 11 };
var avg = arr.Average();
var result = arr.GroupBy(x=>Math.Abs(avg-x)).OrderBy(g=>g.Key).First().OrderBy(x=>x).First();
Original Array
[1,4,3,-7,9,11]
After grouping, key is abs distance from average, items are grouped according to that
[2.5, [1]]
[0.5, [4, 3]]
[5.5, [9]]
[7.5, [11]]
[10.5, [-7]]
Order by group keys
[0.5, [4, 3]]
[2.5, [1]]
[5.5, [9]]
[7.5, [11]]
[10.5, [-7]]
Take first group
[4, 3]
Order group items
[3, 4]
Take first item
3
changed array to [1,4,3,-7,9,11], reversing order of 3 and 4 because they are naturally ordered according to the output originally, and this is necessary to prove the last step

Represent division as a sum of integers

I'm trying to figure out the name of the algorithm representing division operation as an array of integers you can summarize. Each element of this array must be as close to the actual rational result of the division as possible. For example:
5/2 = [3,2] (each element close to 2.5)
100/3 = [34,33,33] (each element close to 33.333(3))
3/1 = [3] (each element close to 3)
It seems like a very basic manipulation. The question is just out of sheer interest: is there a common name for such operation? Maybe it's already included in every math lib and I missed this fact?
Here's how I do it currently:
public IEnumerable<int> Distribute(int a, int b){
var div = a / b;
var rem = a % b;
return Enumerable.Repeat(div + 1, rem).Concat(Enumerable.Repeat(div, b - rem));
}

I think what you're looking for are the Math.Floor() and Math.Ceiling() functions. They can take a decimal number and return the highest (Math.Ceiling()) number and lowest (Math.Floor()) number closest to the decimal number given to those functions.

Find smallest number in given range in an array

Hi i have an array of size N. The array values will always have either 1, 2, 3 integer values only. Now i need to find the lowest number between a given range of array indices. So for e.g. array = 2 1 3 1 2 3 1 3 3 2. the lowest value for ranges like [2-4] = 1, [4-5] = 2, [7-8] = 3, etc.
Below is my code :
static void Main(String[] args) {
string[] width_temp = Console.ReadLine().Split(' ');
int[] width = Array.ConvertAll(width_temp,Int32.Parse); // Main Array
string[] tokens_i = Console.ReadLine().Split(' ');
int i = Convert.ToInt32(tokens_i[0]);
int j = Convert.ToInt32(tokens_i[1]);
int vehicle = width[i];
for (int beg = i+1; beg <= j; beg++) {
if (vehicle > width[beg]) {
vehicle = width[beg];
}
}
Console.WriteLine("{0}", vehicle);
}
The above code works fine. But my concern is about efficiency. In above I am just taking one set of array range, but in actual there will be n number of ranges and I would have to return the lowest for each range. Now the problem is if there is a range like [0-N], N is array size, then I would end up comparing all the items for lowest. So I was wondering if there is a way around to optimize the code for efficiency???

I think it is a RMQ (Range Minimum Query) and there is several implementation which may fit your scenario.
Here is a nice TopCoder Tutorial cover a lot of them, I recommend two of them:
Using the notation in the tutorial, define <P, T> as <Preprocess Complexity, Query Complexity>, there is two famous and common implementation / data structure which can handle RMQ: Square Rooting Array & Segment Tree.
Segment Tree is famous yet hard to implement, it can solve RMQ in <O(n), O(lg n)> though, which has better complexity than Square Rooting Array (<O(n), O(sqrt(n))>)
Square Rooting Array (<O(n), O(sqrt(n))>)
Note That It is not a official name of the technique nor any data structure, indeed I do not know if there is any official naming of this technique since I learnt it...but here we go
For query time, it is definitely not the best you can got to solve RMQ, but it has an advantage: Easy Implementation! (Compared to Segment Tree...)
Here is the high level concept of how it works:
Let N be the length of the array, we split the array into sqrt(N) groups, each contain sqrt(N) elements.
Now we use O(N) time to find the minimum value of each groups, store them into another array call M
So using the above array, M[0] = min(A[0..2]), M[1] = min(A[3..5]), M[2] = min(A[6..8]), M[3] = min(A[9..9])
(The image from TopCoder Tutorial is storing the index of the minimum element)
Now let's see how to query:
For any range [p..q], we can always split this range into 3 parts at most.
Two parts for the left boundaries which is some left over elements that cannot be form a whole group.
One part is the elements in between, which forms some groups.
Using the same example, RMQ(2,7) can be split into 3 parts:
Left Boundary (left over elements): A[2]
Right Boundary (left over elements): A[6], A[7]
In between elements (elements across whole group): A[3],A[4],A[5]
Notice that for those in between elements, we have already preprocessed their minimum using M, so we do not need to look at each element, we can look and compare M instead, there is at most O(sqrt(N)) of them (it is the length of M afterall)
For boundary parts, as they cannot form a whole group by definition, means there is at most O(sqrt(N)) of them (it is the length of one whole group afterall)
So combining two boundary parts, with one part of in between elements, we only need to compare O(3*sqrt(N)) = O(sqrt(N)) elements
You can refer to the tutorial for more details (even for some pseudo codes).

You could do this using Linq extension methods.
List<int> numbers = new List<int> {2, 1, 3, 1, 2, 3, 1, 3, 3, 2};
int minindex =1, maxindex =3, minimum=-1;
if(minindex <= maxindex && maxindex>=0 && maxindex >=0 && maxindex < numbers.Count())
{
minimum = Enumerable.Range(minindex, maxindex-minindex+1) // max inclusive, remove +1 if you want to exclude
.Select(x=> numbers[x]) // Get the elements between given indices
.Min(); // Get the minimum among.
}
Check this Demo

This seems a fun little problem. My first point would be that scanning a fixed array tends to be pretty fast (millions per second), so you'd need a vast amount of data to warrant a more complex solution.
The obvious first thing, is to break from the loop when you have found a 1, as you've found your lowest value then.
If you want something more advanced.
Create a new array of int. Create a pre load function that populates each item of this array with the next index where it gets lower.
Create a loop that uses the new array to skip.
Here is what I mean. Take the following arrays.
int[] intialArray = new int[] { 3, 3, 3, 3, 2, 2, 2, 1 };
int[] searchArray = new int[] { 4, 4, 4, 4, 7, 7, 7, 7 };
So the idea is to find the lowest between positions 0-7.
Start at initialArray[0] and get value 3.
Read searchArray[0] and get the value 4. The 4 is the next index where the number is lower.
Read initialArray[4] and get the value 2.
etc.
So basically you'd need to put some effort to build the searcharray, but onces it's complete you would scan each range much faster.

Form your looping like the following:
int[] inputArray = { 2, 1, 3, 1, 2, 3, 1, 3, 3, 2 };
int minIndex = 2;
int maxIndex = 5;
int minVal = 3;
for (int i = minIndex; i <= maxIndex; i++)
{
if (inputArray[i] <= minVal)
minVal = inputArray[i];
}
Console.WriteLine("Minimum value in the Given range is ={0}", minVal);

Compare arrays of int in high performance

I cant remember from my days in college, the way to compare two unsorted arrays of int and find the number of matches ?
Each value is unique in it own array, and both arrays are the same size.
for example
int[5] a1 = new []{1,2,4,5,0}
int[5] a2 = new []{2,4,11,-6,7}
int numOfMatches = FindMatchesInPerformanceOfNLogN(a1,a2);
any one does remember ?

If you can store the contents of one of the arrays in a HashMap, then you can check for the existence of the elements in the other array by seeing if they exist in the HashMap. This is O(n).

One array must be sorted so that you can compare in n*log(n). That is for every item in the unsorted array (n) you perform a binary search on the sorted array (log(n)). If both are unsorted, I don't see a way to compare in n*log(n).

how about this:
concatenate the two arrays
quicksort the result
step through from array[1] to array[array.length - 1] and check array[i] against array[i-1]
if they are the same, you had a duplicate. This should also be O(n*log(n)) and will not require a binary search for each check.

You could use LINQ:
var a1 = new int[5] {1, 2, 4, 5, 0};
var a2 = new int[5] {2, 4, 11, -6, 7};
var matches = a1.Intersect(a2).Count();
I'm not sure if you're just asking for a straight-forward way or the fastest/best way possible...

You have two methods that I am aware of (ref: http://www2.cs.siu.edu/~mengxia/Courses%20PPT/220/carrano_ppt08.ppt):
Recursive (pseudocode)
Algorithm to search a[first] through a[last] for desiredItem
if (there are no elements to search)
return false
else if (desiredItem equals a[first])
return true
else
return the result of searching a[first+1] through a[last]
Efficiency
May be O(log n) though I have not tried it.
Sequential Search (pseudocode)
public boolean contains(Object anEntry)
{
boolean found = false;
for (int index = 0; !found && (index < length); index++) {
if (anEntry.equals(entry[index]))
found = true;
}
return found;
}
Efficiency of a Sequential Search
Best case O(1)
Locate desired item first
Worst case O(n)
Must look at all the items
Average case O(n)
Must look at half the items
O(n/2) is still O(n)
I am not aware of an O(log n) search algorithm unless it is sorted.

I don't know if it is the fastest way but you can do
int[] a1 = new []{1,2,4,5,0};
int[] a2 = new []{2,4,11,-6,7};
var result = a1.Intersect(a2).Count();
It is worth comparing this with other ways that are optimised for int as Intersect() operates on IEnumerable.

This problem is also amenable to parallelization: spawn n1 threads and have each one compare an element of a1 with n2 elements of a2, then sum values. Probably slower, but interesting to consider, is spawning n1 * n2 threads to do all comparisons simultaneously, then reducing. If P >> max(n1, n2) in the first case, P >> n1 * n2 in the second, you could do the whole thing in O(n) in the first case, O(log n) in the second.

Find the number of divisors of a number given an array of prime factors using LINQ

Given an array of prime factors of a natural number, how can I find the total number of divisors using LINQ upon the original array? I've already figured out most of this problem, but I'm having trouble with my LINQ statement.
Math Background:
The prime factors of a number are the prime integers that divide evenly into the number without a remainder. e.g. The prime factors of 60 are 2,2,3,5.
The divisors of a number are all integers (prime or otherwise) that divide evenly into the number without a remainder. The divisors of 60 are 1,2,3,4,5,6,10,12,15,20,30,60.
I am interested in finding the total number of divisors. The total number of divisors for 60 is 12.
Let's express the prime factorization using exponents:
60 = 2^2 * 3^1 * 5*1
To find the total number of divisors given the prime factorization of the number, all we have to do is add 1 to each exponent and then multiply those numbers together, like so:
(2 + 1) * (1 + 1) * (1 + 1) = 12;
That's how you find the number of divisors given the prime factorization of a number.
The Code I Have So Far:
I already have good code to get the prime factors of a number, so I'm not concerned about that. Using LINQ, I want to figure out what the total number of divisors is. I could use a few loops, but I'm trying to use LINQ (if possible).
I'm going to:
Use Distinct() to find the unique values in the array.
Use Count() to find how many time the unique values occur (this is equal to the exponent).
Use an Aggregate() function to multiply the values together.
Here's the code I have:
class Program
{
static void Main(string[] args)
{
var primeFactors = new int[] { 2, 2, 3, 5 };
Console.WriteLine(primeFactors.Distinct().PrintList("", ", "));
//Prints: 2, 3, 5
Console.WriteLine("[2]:{0} [3]:{1} [5]:{2}"
, primeFactors.Count(x => x == 2)
, primeFactors.Count(x => x == 3)
, primeFactors.Count(x => x == 5)
);
//Prints: [2]:2 [3]:1 [5]:1
\\THIS IS WHERE I HAVE TROUBLE:
Console.WriteLine(primeFactors.Distinct().Aggregate((total,next) =>
(primeFactors.Count(x => x ==next) + 1)* total));
//Prints: 8
Console.ReadLine();
}
}
Specifically, I'm having trouble with this part of code:
primeFactors.Distinct().Aggregate((total,next) =>
(primeFactors.Count(x => x ==next) + 1)* total)
Since the numbers in my array are not stored in the form of x^n, but rather in the form of n instances of x in the array, my thinking is to use Count() to find what n ought to be on a distinct array of x. The Aggregate function is intended to iterate through each distinct item in the array, find its Count + 1, and then multiply that by the total. The lambda expression in Count is supposed to use each distinct number as a parameter (next).
The above code should return 12, but instead it returns 8. I have trouble "stepping through" LINQ in debug mode and I can't figure out how I might better write this.
Why doesn't that portion of my code return the correct number of divisors as I expect? Is there a different (better) way to express this using LINQ?

Try this:
int[] factors = new int[] { 2, 2, 3, 5 };
var q = from o in factors
group o by o into g
select g.Count() + 1;
var r = q.Aggregate((x, y) => x * y);
The specific problem with your suggested query is that your aggregate call fails to count the very first element (not to mention doesn't increment the count by 1). What it erroneously does is take the first factor and multiplies its value instead of its count + 1 with the next one.

If I understand what you're looking to do, you may want to use GroupBy() instead.
var primeFactors = new int[]{ 2, 2, 3, 5 };
var numFacs = primeFactors.GroupBy(f => f, f => f, (g, s) => s.Count() + 1)
.Aggregate(1, (x, y) => x * y);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Implementing jaccard similarity in c# - c#

Jaccard similarity is an index of the size of intersection between two sets, divided by the size of the union. In your case, you'd have to write the code to find out how many elements appear in both arrays, then divide that by the sum of the size of both arrays.

Related

Should I use Sum method and Count/Length find the element of array that is the closest to the middle value of all elements?

Represent division as a sum of integers

Find smallest number in given range in an array

Compare arrays of int in high performance

Find the number of divisors of a number given an array of prime factors using LINQ

Categories

Resources