Loop through every possible combination of values in a BitArray - c#

I'm trying to solve a larger problem. As part of this, I have created a BitArray to represent a series of binary decisions taken sequentially. I know that all valid decision series will have half of all decisions true, and half of all false, but I don't know the order:
ttttffff
[||||||||]
Or:
tftftftf
[||||||||]
Or:
ttffttff
[||||||||]
Or any other combination where half of all bits are true, and half false.
My BitArray is quite a bit longer than this, and I need to move through each set of possible decisions (each possible combination of half true, half false), making further checks on their validity. I'm struggling to conceptually work out how to do this with a loop, however. It seems like it should be simple, but my brain is failing me.
EDIT: Because the BitArray wasn't massive, I used usr's suggestion and implemented a bitshift loop. Based on some of the comments and answers, I re-googled the problem with the key-word "permutations" and found this Stack Overflow question which is very similar.

I'd do this using a recursive algorithm. Each level sets the next bit. You keep track of how many zeroes and ones have been decided already. If one of those counters goes above N / 2 you abort the branch and backtrack. This should give quite good performance because it will tend to cut off infeasible branches quickly. For example, after setting tttt only f choices are viable.
A simpler, less well-performing, version would be to just loop through all possible N-bit integers using a for loop and discarding the ones that do not fulfill the condition. This is easy to implement for up to 63 bits. Just have a for loop from 0 to 1 << 63. Clearly, with high bitcounts this is too slow.
You are looking for all permutations of N / 2 zeroes and N / 2 ones. There are algorithms for generating those. If you can find one implemented this should give the best possible performance. I believe those algorithms use clever math tricks to only visit viable combinations.

If you're OK with using the bits in an integer instead of a BitArray, this is a general solution to generate all patterns with some constant number of bits set.
Start with the lowest valid value, which is with all the ones at the right side of the number, which you can calculate as low = ~(-1 << k) (doesn't work for k=32, but that's not an issue in this case).
Then take Gosper's Hack (also shown in this answer), which is a way to generate the next highest integer with equally many bits set, and keep applying it until you reach the highest valid value, low << k in this case.

This will result in duplicates, but you could check for duplicates before adding to the List if you want to.
static void Main(string[] args)
{
// Set your bits here:
bool[] bits = { true, true, false };
BitArray original_bits = new BitArray(bits);
permuteBits(original_bits, 0, original_bits.Length - 1);
foreach (BitArray ba in permutations)
{
// You can check Validity Here
foreach (bool i in ba)
{
Console.Write(Convert.ToInt32(i));
}
Console.WriteLine();
}
}
static List<BitArray> permutations = new List<BitArray>();
static void permuteBits(BitArray bits, int minIndex, int maxIndex)
{
int current_index;
if (minIndex == maxIndex)
{
permutations.Add(new BitArray(bits));
}
else
{
for (current_index = minIndex; current_index <= maxIndex; current_index++)
{
swap(bits, minIndex, current_index);
permuteBits(bits, minIndex + 1, maxIndex);
swap(bits, minIndex, current_index);
}
}
}
private static void swap(BitArray bits, int i, int j)
{
bool temp = bits[i];
bits[i] = bits[j];
bits[j] = temp;
}

If you want to clear the concept of finding all the permutation of a string having duplicates entry(i.e zeros and ones), you can read this article
They have used recursive solution to solve this problem and the explanation is also good.

Related

Best way to GetHashCode() for 44-bit number stored as Int64

I have around 5,000,000 objects stored in a Dictionary<MyKey, MyValue>.
MyKey is a struct that packs each component of my key (5 different numbers) in the right-most 44 bits of an Int64 (ulong).
Since the ulong will always start with 20 zero-bits, my gut feeling is that returning the native Int64.GetHashCode() implementation is likely to collide more often, than if the hash code implementation only considers the 44 bits that are actually in use (although mathematically, I wouldn't know where to begin to prove that theory).
This increases the number of calls to .Equals() and makes dictionary lookups slower.
The .NET implementation of Int64.GetHashCode() looks like this:
public override int GetHashCode()
{
return (int)this ^ (int)(this >> 32);
}
How would I best implement GetHashCode()?
I couldn't begin to suggest a "best" way to hash 44-bit numbers. But, I can suggest a way to compare it to the 64-bit hash algorithm.
One way to do this is to simply check how many collisions you get for a set of numbers (as suggested by McKenzie et al in Selecting a Hashing Algorithm) Unless you're going to test all possible values of your set, you'll need to judge whether the # of collisions you get is acceptable. This could be done in code with something like:
var rand = new Random(42);
var dict64 = new Dictionary<int, int>();
var dict44 = new Dictionary<int, int>();
for (int i = 0; i < 100000; ++i)
{
// get value between 0 and 0xfffffffffff (max 44-bit value)
var value44 = (ulong)(rand.NextDouble() * 0x0FFFFFFFFFFF);
var value64 = (ulong)(rand.NextDouble() * ulong.MaxValue);
var hash64 = value64.GetHashCode();
var hash44 = (int)value44 ^ (int)(value44>> 32);
if (!dict64.ContainsValue(hash64))
{
dict64.Add(hash64,hash64);
}
if (!dict44.ContainsValue(hash44))
{
dict44.Add(hash44, hash44);
}
}
Trace.WriteLine(string.Format("64-bit hash: {0}, 64-bit hash with 44-bit numbers {1}", dict64.Count, dict44.Count));
In other words, consistently generate 100,000 random 64-bit values and 100,000 random 44-bit values, perform a hash on each and keep track of unique values.
In my test this generated 99998 unique values for 44-bit numbers and 99997 unique values for 64-bit numbers. So, that's one less collision for 44-bit numbers over 64-bit numbers. I would expect less collisions with 44-bit numbers simply because you have less possible inputs.
I'm not going to tell you the 64-bit hash method is "best" for 44-bit; you'll have to decide if these results mean it's good for your circumstances.
Ideally you should be testing with realistic values that your application is likely to generate. Given those will all be 44-bit values, it's hard to compare that to the collisions ulong.GetHashCode() produces (i.e. you'd have identical results). If random values based on a constant seed isn't good enough, modify the code with something better.
While things might not "feel" right, science suggests there's no point in changing something without reproducible tests that prove a change is necessary.
Here's my attempt to answer this question, which I'm posting despite the fact that the answer is the opposite of what I was expecting. (Although I may have made a mistake somewhere - I almost hope so, and am open to criticism regarding my test technique.)
// Number of Dictionary hash buckets found here:
// http://stackoverflow.com/questions/24366444/how-many-hash-buckets-does-a-net-dictionary-use
const int CNumberHashBuckets = 4999559;
static void Main(string[] args)
{
Random randomNumberGenerator = new Random();
int[] dictionaryBuckets1 = new int[CNumberHashBuckets];
int[] dictionaryBuckets2 = new int[CNumberHashBuckets];
for (int i = 0; i < 5000000; i++)
{
ulong randomKey = (ulong)(randomNumberGenerator.NextDouble() * 0x0FFFFFFFFFFF);
int simpleHash = randomKey.GetHashCode();
BumpHashBucket(dictionaryBuckets1, simpleHash);
int superHash = ((int)(randomKey >> 12)).GetHashCode() ^ ((int)randomKey).GetHashCode();
BumpHashBucket(dictionaryBuckets2, superHash);
}
int collisions1 = ComputeCollisions(dictionaryBuckets1);
int collisions2 = ComputeCollisions(dictionaryBuckets2);
}
private static void BumpHashBucket(int[] dictionaryBuckets, int hashedKey)
{
int bucketIndex = (int)((uint)hashedKey % CNumberHashBuckets);
dictionaryBuckets[bucketIndex]++;
}
private static int ComputeCollisions(int[] dictionaryBuckets)
{
int i = 0;
foreach (int dictionaryBucket in dictionaryBuckets)
i += Math.Max(dictionaryBucket - 1, 0);
return i;
}
I try to simulate how the processing done by Dictionary will work. The OP says he has "around 5,000,000" objects in a Dictionary, and according to the referenced source there will be either 4999559 or 5999471 "buckets" in the Dictionary.
Then I generate 5,000,000 random 44-bit keys to simulate the OP's Dictionary entries, and for each key I hash it two different ways: the simple ulong.GetHashCode() and an alternative way that I suggested in a comment. Then I turn each hash code into a bucket index using modulo - I assume that's how it is done by Dictionary. This is used to increment the pseudo buckets as a way of computing the number of collisions.
Unfortunately (for me) the results are not as I was hoping. With 4999559 buckets the simulation typically indicates around 1.8 million collisions, with my "super hash" technique actually having a few (around 0.01%) MORE collisions. With 5999471 buckets there are typically around 1.6 million collisions, and my so-called super hash gives maybe 0.1% fewer collisions.
So my "gut feeling" was wrong, and there seems to be no justification for trying to find a better hash code technique.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

Determining how close an array is to the target array

I'm playing a little experiment to increase my knowledge and I have reached a part where I feel i could really optimize it, but am not quite sure how to do this.
I have many arrays of numbers. (for simplicity, lets say each array has 4 numbers: 1, 2, 3, and 4)
The target is to have all of the numbers in ascending order (ie,
1-2-3-4), but the numbers are all scrambled in the different arrays.
A higher weight is placed upon larger numbers.
I need to sort all of these arrays in order of how close they are to
the target.
Ie, 4-3-2-1 would be the worst possible case.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2 (even though two numbers match (1 and 3).
the biggest number is closer to its spot.)
So the big numbers always take precedence over the smaller numbers. Here is my attempt:
var tmp = from m in moves
let mx = m.Max()
let ranking = m.IndexOf(s => s == mx)
orderby ranking descending
select m;
return tmp.ToArray();
P.S IndexOf in the above example, is an extension I wrote to take an array and expression, and return the index of the element that satisfies the expression. It is needed because the situation is really a little more complicated, i'm simplifying it with my example.
The problem with my attempt here though, is that it would only sort by the biggest number, and forget all of the other numbers. it SHOULD rank by biggest number first, then by second largest, then by third.
Also, since it will be doing this operation over and over again for several minutes, it should be as efficient as possible.
You could implement a bubble sort, and count the number of times you have to move data around. The number of data moves will be large on arrays that are far away from the sorted ideal.
int GetUnorderedness<T>(T[] data) where T : IComparable<T>
{
data = (T[])data.Clone(); // don't modify the input data,
// we weren't asked to actually sort.
int swapCount = 0;
bool isSorted;
do
{
isSorted = true;
for(int i = 1; i < data.Length; i++)
{
if(data[i-1].CompareTo(data[i]) > 0)
{
T temp = data[i];
data[i] = data[i-1];
data[i-1] = temp;
swapCount++;
isSorted = false;
}
}
} while(!isSorted);
}
From your sample data, this will give slightly different results than you specified.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2
3-4-2-1 will take 5 swaps to sort, 4-3-2-1 will take 6, so that works.
2-3-4-1 will take 3, 1-4-3-2 will also take 3, so this doesn't match up with your expected results.
This algorithm doesn't treat the largest number as the most important, which it seems you want; all numbers are treated equally. From your description, you'd consider 2-1-3-4 as much better than 1-2-4-3, because the first one has both the largest and second largest numbers in their proper place. This algorithm would consider those two equal, because each requires only 1 swap to sort the array.
This algorithm does have the advantage that it's not just a comparison algorithm, each input has a discrete output, so you only need to run the algorithm once for each input array.
I hope this helps
var i = 0;
var temp = (from m in moves select m).ToArray();
do
{
temp = (from m in temp
orderby m[i] descending
select m).ToArray();
}
while (++i < moves[0].Length);

Find number of differences in 2 strings

int n = string.numDifferences("noob", "newb"); // 2
??
The number you are trying to find is called the edit distance. Wikipedia lists several algorithms you might want to use; the Hamming distance is a very common way of finding the edit difference between two strings of the same length (it's often used in error-correcting codes); the Levenshtein distance is similar, but also takes insertions and deletions into account. Wikipedia, of course, lists several others (e.g. Damerau-Levenshtein distance, which includes transpositions); I don't know which you want, as I'm no expert and the choice is domain-specific. One of these, though, should do the trick.
Assuming that you only want to compare characters at the same indices, the following C# solution (using methods provided by LINQ) should do the trick:
var count = s1.Zip(s2, (c1, c2) => c1 == c2 ? 0 : 1).Sum();
This "zips" the two strings, and then returns 0 for each index where the characters are the same and 1 for each index where they differ. Then we simply sum the numbers and we get the result.
You already got excellent answers if you mean "edit distance". If you just mean "number of characters that differ" (for two strings of the same length), in Python, the simplest approach would be:
sum(c1!=c2 for c1, c2 in zip(s1, s2))
and if you also want to add the length difference, append
+ abs(len(s1) - len(s2))
Of course, if you do want edit distances, this approach would be far too simplistic;-).
import java.util.*;
class AnagramStringDifference
{ public int AnagramStringDifferenceString A, String B)
{ int diff=0,Ai=0,Bi=0;
char[] Aa= A.toCharArray();
char[] Bb= B.toCharArray();
Arrays.sort(Aa);
Arrays.sort(Bb);
while(Ai<Aa.length && Bi< Bb.length)
{ int c=Character.compare(Aa[Ai],Bb[Bi]);
if(c<0)
{ diff++;
Ai++;
}else if(c>0)
{ diff++;
Bi++;
}else if(c==0)
{ Ai++;
Bi++;
}
}
diff+=Math.abs((Aa.length-Ai)-(Bb.length-Bi));
return diff;
}
}
P.S. I got asked such a similar difficult question from Codility online test for online job application, with around only 2 hours limit to 4 hard questions. I wonder how overcrowded the IT industry is, or how much unreasonable pressure that management places on IT workers, if temp agency recruiters can get away asking such an difficult question for screening an entry-level pay technical support job.
import math
def differences(s1, s2):
count = 0
for i in range(len(s1)):
count += int(s1[i] != s2[1])
# count += math.sqrt( (len(s1) - len(s2)) **2) #add this line if the two strings are of different length and differences counts the how many characters one string has more than the other.
return count
Hope this helps

Categories