Selecting set of binary sequences to avoid similarity - c#

I want to be able to programatically generate a set of binary sequences of a given length whilst avoiding similarity between any two sequences.
I'll define 'similar' between two sequences thus:
If sequence A can be converted to sequence B (or B to A) by bit-shifting A (non-circularly) and padding with 0s, A and B are similar (note: bit-shifting is allowed on only one of the sequences otherwise both could always be shifted to a sequence of just 0s)
For example: A = 01010101 B = 10101010 C = 10010010
In this example, A and B are similar because a single left-shift of A results in B (A << 1 = B). A and C are not similar because no bit-shifting of one can result in the other.
A set of sequences is defined is dissimilar if no subset of size 2 is similar.
I believe there could be multiple sets for a given sequence length and presumably the size of the set will be significantly less than the total possibilities (total possibilities = 2 ^ sequence length).
I need a way to generate a set for a given sequence length. Does an algorithm exist that can achieve this? Selecting sequences one at a time and checking against all previously selected sequences is not acceptable for my use case (but may have to be if a better method doesn't exist!).
I've tried generating sets of integers based on primes numbers and also the golden ratio, then converting to binary. This seemed like it might be a viable method, but I have been unable to get it to work as expected.
Update: I have written a function in C# to that uses a prime number modulo to generate the set without success. Also I've tried using the Fibonacci sequence which finds a mostly dissimilar set, but of a size that is very small compared to the number of possibilities:
private List<string> GetSequencesFib(int sequenceLength)
{
var sequences = new List<string>();
long current = 21;
long prev = 13;
long prev2 = 8;
long size = (long)Math.Pow(2, sequenceLength);
while (current < size)
{
current = prev + prev2;
sequences.Add(current.ToBitString(sequenceLength));
prev2 = prev;
prev = current;
}
return sequences;
}
This generates a set of sequences of size 41 that is roughly 60% dissimilar (sequenceLength = 32). It is started at 21 since lower values produce sequences of mostly 0s which are similar to any other sequence.
By relaxing the conditions of similarity to only allowing a small number of successive bit-shifts, the proportion of dissimilar sequences approaches 100%. This may be acceptable in my use case.
Update 2:
I've implemented a function following DCHE's suggestion, by selecting all odd numbers greater than half the maximum value for a given sequence length:
private static List<string> GetSequencesOdd(int length)
{
var sequences = new List<string>();
long max = (long)(Math.Pow(2, length));
long quarterMax = max / 4;
for (long n = quarterMax * 2 + 1; n < max; n += 2)
{
sequences.Add(n.ToBitString(length));
}
return sequences;
}
This produces an entirely dissimilar set as per my requirements. I can see why this works mathematically as well.

I can't prove it, but from my experimenting, I think that your set is the odd integers greater than half of the largest number in binary. E.g. for bit sets of length 3, max integer is 7, so the set is 5 and 7 (101, 111).

Related

Bit reverse numbers by N bits

I am trying to find a simple algorithm that reverses the bits of a number up to N number of bits. For example:
For N = 2:
01 -> 10
11 -> 11
For N = 3:
001 -> 100
011 -> 110
101 -> 101
The only things i keep finding is how to bit reverse a full byte but thats only going to work for N = 8 and thats not always what i need.
Does any one know an algorithm that can do this bitwise operation? I need to do many of them for an FFT so i'm looking for something that can be very optimised too.
It is the C# implementation of bitwise reverse operation:
public uint Reverse(uint a, int length)
{
uint b = 0b_0;
for (int i = 0; i < length; i++)
{
b = (b << 1) | (a & 0b_1);
a = a >> 1;
}
return b;
}
The code above first shifts the output value to the left and adds the bit in the smallest position of the input to the output and then shifts the input to right. and repeats it until all bits finished. Here are some samples:
uint a = 0b_1100;
uint b = Reverse(a, 4); //should be 0b_0011;
And
uint a = 0b_100;
uint b = Reverse(a, 3); //should be 0b_001;
This implementation's time complexity is O(N) which N is the length of the input.
Edit in Dotnet fiddle
Here's a small look-up table solution that's good for (2<=N<=32).
For N==8, I think everyone agrees that a 256 byte array lookup table is the way to go. Similarly, for N from 2 to 7, you could create 4, 8, ... 128 lookup byte arrays.
For N==16, you could flip each byte and then reorder the two bytes. Similarly, for N==24, you could flip each byte and then reorder things (which would leave the middle one flipped but in the same position). It should be obvious how N==32 would work.
For N==9, think of it as three 3-bit numbers (flip each of them, reorder them and then do some masking and shifting to get them in the right position). For N==10, it's two 5-bit numbers. For N==11, it's two 5-bit numbers on either side of a center bit that doesn't change. The same for N==13 (two 6-bit numbers around an unchanging center bit). For a prime like N==23, it would be a pair of 8- bit numbers around a center 7-bit number.
For the odd numbers between 24 and 32 it gets more complicated. You probably need to consider five separate numbers. Consider N==29, that could be four 7-bit numbers around an unchanging center bit. For N==31, it would be a center bit surround by a pair of 8-bit numbers and a pair of 7-bit numbers.
That said, that's a ton of complicated logic. It would be a bear to test. It might be faster than #MuhammadVakili's bit shifting solution (it certainly would be for N<=8), but it might not. I suggest you go with his solution.
Using string manipulation?
static void Main(string[] args)
{
uint number = 269;
int numBits = 4;
string strBinary = Convert.ToString(number, 2).PadLeft(32, '0');
Console.WriteLine($"{number}");
Console.WriteLine($"{strBinary}");
string strBitsReversed = new string(strBinary.Substring(strBinary.Length - numBits, numBits).ToCharArray().Reverse().ToArray());
string strBinaryModified = strBinary.Substring(0, strBinary.Length - numBits) + strBitsReversed;
uint numberModified = Convert.ToUInt32(strBinaryModified, 2);
Console.WriteLine($"{strBinaryModified}");
Console.WriteLine($"{numberModified}");
Console.Write("Press Enter to Quit.");
Console.ReadLine();
}
Output:
269
00000000000000000000000100001101
00000000000000000000000100001011
267
Press Enter to Quit.

Cryptography .NET, Avoiding Timing Attack

I was browsing crackstation.net website and came across this code which was commented as following:
Compares two byte arrays in length-constant time. This comparison method is used so that password hashes cannot be extracted from on-line systems using a timing attack and then attacked off-line.
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
for (int i = 0; i < a.Length && i < b.Length; i++)
diff |= (uint)(a[i] ^ b[i]);
return diff == 0;
}
Can anyone please explain how does this function actual works, why do we need to convert the length to an unsigned integer and how this method avoids a timing attack? What does the line diff |= (uint)(a[i] ^ b[i]); do?
This sets diff based on whether there's a difference between a and b.
It avoids a timing attack by always walking through the entirety of the shorter of the two of a and b, regardless of whether there's a mismatch sooner than that or not.
The diff |= (uint)(a[i] ^ (uint)b[i]) takes the exclusive-or of a byte of a with a corresponding byte of b. That will be 0 if the two bytes are the same, or non-zero if they're different. It then ors that with diff.
Therefore, diff will be set to non-zero in an iteration if a difference was found between the inputs in that iteration. Once diff is given a non-zero value at any iteration of the loop, it will retain the non-zero value through further iterations.
Therefore, the final result in diff will be non-zero if any difference is found between corresponding bytes of a and b, and 0 only if all bytes (and the lengths) of a and b are equal.
Unlike a typical comparison, however, this will always execute the loop until all the bytes in the shorter of the two inputs have been compared to bytes in the other. A typical comparison would have an early-out where the loop would be broken as soon as a mismatch was found:
bool equal(byte a[], byte b[]) {
if (a.length() != b.length())
return false;
for (int i=0; i<a.length(); i++)
if (a[i] != b[i])
return false;
return true;
}
With this, based on the amount of time consumed to return false, we can learn (at least an approximation of) the number of bytes that matched between a and b. Let's say the initial test of length takes 10 ns, and each iteration of the loop takes another 10 ns. Based on that, if it returns false in 50 ns, we can quickly guess that we have the right length, and the first four bytes of a and b match.
Even without knowing the exact amounts of time, we can still use the timing differences to determine the correct string. We start with a string of length 1, and increase that one byte at a time until we see an increase in the time taken to return false. Then we run through all the possible values in the first byte until we see another increase, indicating that it has executed another iteration of the loop. Continue with the same for successive bytes until all bytes match and we get a return of true.
The original is still open to a little bit of a timing attack -- although we can't easily determine the contents of the correct string based on timing, we can at least find the string length based on timing. Since it only compares up to the shorter of the two strings, we can start with a string of length 1, then 2, then 3, and so on until the time becomes stable. As long as the time is increasing our proposed string is shorter than the correct string. When we give it longer strings, but the time remains constant, we know our string is longer than the correct string. The correct length of string will be the shortest one that takes that maximum duration to test.
Whether this is useful or not depends on the situation, but it's clearly leaking some information, regardless. For truly maximum security, we'd probably want to append random garbage to the end of the real string to make it the length of the user's input, so the time stays proportional to the length of the input, regardless of whether it's shorter, equal to, or longer than the correct string.
this version goes on for the length of the input 'a'
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
byte[] c = new byte[] { 0 };
for (int i = 0; i < a.Length; i++)
diff |= (uint)(GetElem(a, i, c, 0) ^ GetElem(b, i, c, 0));
return diff == 0;
}
private static byte GetElem(byte[] x, int i, byte[] c, int i0)
{
bool ok = (i < x.Length);
return (ok ? x : c)[ok ? i : i0];
}

List<T> capacity increasing vs Dictionary<K,V> capacity increasing?

Why does List<T> increase its capacity by a factor of 2?
private void EnsureCapacity(int min)
{
if (this._items.Length < min)
{
int num = (this._items.Length == 0) ? 4 : (this._items.Length * 2);
if (num < min)
{
num = min;
}
this.Capacity = num;
}
}
Why does Dictionary<K,V> use prime numbers as capacity?
private void Resize()
{
int prime = HashHelpers.GetPrime(this.count * 2);
int[] numArray = new int[prime];
for (int i = 0; i < numArray.Length; i++)
{
numArray[i] = -1;
}
Entry<TKey, TValue>[] destinationArray = new Entry<TKey, TValue>[prime];
Array.Copy(this.entries, 0, destinationArray, 0, this.count);
for (int j = 0; j < this.count; j++)
{
int index = destinationArray[j].hashCode % prime;
destinationArray[j].next = numArray[index];
numArray[index] = j;
}
this.buckets = numArray;
this.entries = destinationArray;
}
Why doesn't it also just multiply by 2? Both are dealing with finding continues memory location...correct?
It's common to use prime numbers for hash table sizes because it reduces the probability of collisions.
Hash tables typically use the modulo operation to find the bucket where an entry belongs, as you can see in your code:
int index = destinationArray[j].hashCode % prime;
Suppose your hashCode function results in the following hashCodes among others {x , 2x, 3x, 4x, 5x, 6x...}, then all these are going to be clustered in just m number of buckets, where m = table_length/GreatestCommonFactor(table_length, x). (It is trivial to verify/derive this). Now you can do one of the following to avoid clustering:
Make sure that you don't generate too many hashCodes that are multiples of another hashCode like in {x, 2x, 3x, 4x, 5x, 6x...}.But this may be kind of difficult if your hashTable is supposed to have millions of entries.
Or simply make m equal to the table_length by making GreatestCommonFactor(table_length, x) equal to 1, i.e by making table_length coprime with x. And if x can be just about any number then make sure that table_length is a prime number.
(from http://srinvis.blogspot.com/2006/07/hash-table-lengths-and-prime-numbers.html)
HashHelpers.GetPrime(this.count * 2)
should return a prime number. Look at the definition of HashHelpers.GetPrime().
Dictionary puts all its objects into buckets depending on their GetHashCode value, i.e.
Bucket[object.GetHashCode() % DictionarySize] = object;
It uses a prime number for size to avoid the chance of collisions. Presumably a size with many divisors would be bad for poorly designed hash codes.
From a question in SO;
Dictionary or hash table relies on hashing the key to get a smaller
index to look up into corresponding store (array). So choice of hash
function is very important. Typical choice is to get hash code of a
key (so that we get good random distribution) and then divide the code
by a prime number and use reminder to index into fixed number of
buckets. This allows to convert arbitrarily large hash codes into a
bounded set of small numbers for which we can define an array to look
up into. So its important to have array size in prime number and then
the best choice for the size become the prime number that is larger
than the required capacity. And that's exactly dictionary
implementation does.
List<T> employs arrays to store data; and increasing the capacity of an array requires copying the array to a new memory location; which is time consuming. I guess, in order to lower the occurence of copying arrays, list doubles it's capacity.
I'm not computer scientist, but ...
Most probabbly its related to a HashTable's Load factor (the last link just a math definition), and for not creating more confusion, for not math auditory, it's important to define that:
loadFactor = FreeCells/AllCells
this we can write as
loadFactor = (AllBuckets - UsedBuckets)/AllBuckets
loadFactor defines a probabbilty of collision in hash map.
So by using a Prime Number,a number that
..is a natural number greater than 1 that
has no positive divisors other than 1 and itself.
we decrease (but do not erase) a risk of collision in our hashmap.
If loadFactor tends to 0, we have more secure hashmap, so we always has to keep it as low as possible. By MS blog, they found out that the value of that loadFactor (optimal one) has to be arround 0.72, so if it becomes bigger, we increase the capacity following nearest prime number.
EDIT
To be more clear on this: having a prime number, ensures, as mush as it possible, uniform destribution of the hashes in this concrete implementation of the hash we have in .NET dictionary. It's not about efficency of the retrieval of the values, but efficiency of the memory used and collision risk reduction.
Hope this helps.
Dictionary needs some heuristic so that hash code distribution among buckets is more uniform.
.NET's Dictionary uses prime number of buckets to do that, and then calculates bucket index like this:
int num = this.comparer.GetHashCode(key) & 2147483647; // make hash code positive
// get the remainder from division - that's our bucket index
int num2 = this.buckets[num % ((int)this.buckets.Length)];
When it grows, it doubles the number of buckets and then adds some more to make the number prime again.
It's not the only heuristic possible. Java's HashMap, for example, takes another approach. The number of buckets there is always a power of 2 and on grow it just doubles the number of buckets:
resize(2 * table.length);
But when calculating bucket index it modifies hash:
static int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
static int indexFor(int h, int length) {
return h & (length-1);
}
// from put() method
int hash = hash(key.hashCode()); // get modified hash
int i = indexFor(hash, table.length); // trim the hash to the bucket count
List on the other hand doesn't need any heuristic, so they didn't bother.
Addition: Grow behavior doesn't influence Add's complexity at all. Dictionary, HashMap and List each have amortized Add complexity of O(1).
Grow operation takes O(N) but occurs only N-th time, so to cause grow operation we need to call Add N times. For N=8 the time it takes to do N Adds has the value
O(1)+O(1)+O(1)+O(1)+O(1)+O(1)+O(1)+O(N) = O(N)+O(N) = O(2N) = O(N)
So, N Adds take O(N), then one Add takes O(1).
Increasing the capacity by a constant factor (instead of for example increasing the capacity by a additive constant) when resizing is required to guarantee some amortized running times. For example adding to or removing from the end of an array based list requires O(1) time except when you have to increase or decrease the capacity requiring to copy the list content and therefore requiring O(n) time. Changing the capacity by a constant factor guarantees that the amortized runtime is still O(1). The optimal value of the factor depends on the expected usage. Some more information on Wikipedia.
Choosing the capacity of a hash table to be prime is used to improve the distribution of the items. bucket[hash % capacity] will yield a more uniform distribution if hash is not uniformly distributed if capacity is prime. (I can not give the math behind that but I am looking for a good reference.) The combination of this with the first point is exactly what the implementation does - increasing the capacity by a factor (of at least) 2 and also ensure that the capacity is prime.

Fastest way to sum digits in a number

Given a large number, e.g. 9223372036854775807 (Int64.MaxValue), what is the quickest way to sum the digits?
Currently I am ToStringing and reparsing each char into an int:
num.ToString().Sum(c => int.Parse(new String(new char[] { c })));
Which is surely horrifically inefficent. Any suggestions?
And finally, how would you make this work with BigInteger?
Thanks
Well, another option is:
int sum = 0;
while (value != 0)
{
int remainder;
value = Math.DivRem(value, 10, out remainder);
sum += remainder;
}
BigInteger has a DivRem method as well, so you could use the same approach.
Note that I've seen DivRem not be as fast as doing the same arithmetic "manually", so if you're really interested in speed, you might want to consider that.
Also consider a lookup table with (say) 1000 elements precomputed with the sums:
int sum = 0;
while (value != 0)
{
int remainder;
value = Math.DivRem(value, 1000, out remainder);
sum += lookupTable[remainder];
}
That would mean fewer iterations, but each iteration has an added array access...
Nobody has discussed the BigInteger version. For that I'd look at 101, 102, 104, 108 and so on until you find the last 102n that is less than your value. Take your number div and mod 102n to come up with 2 smaller values. Wash, rinse, and repeat recursively. (You should keep your iterated squares of 10 in an array, and in the recursive part pass along the information about the next power to use.)
With a BigInteger with k digits, dividing by 10 is O(k). Therefore finding the sum of the digits with the naive algorithm is O(k2).
I don't know what C# uses internally, but the non-naive algorithms out there for multiplying or dividing a k-bit by a k-bit integer all work in time O(k1.6) or better (most are much, much better, but have an overhead that makes them worse for "small big integers"). In that case preparing your initial list of powers and splitting once takes times O(k1.6). This gives you 2 problems of size O((k/2)1.6) = 2-0.6O(k1.6). At the next level you have 4 problems of size O((k/4)1.6) for another 2-1.2O(k1.6) work. Add up all of the terms and the powers of 2 turn into a geometric series converging to a constant, so the total work is O(k1.6).
This is a definite win, and the win will be very, very evident if you're working with numbers in the many thousands of digits.
Yes, it's probably somewhat inefficient. I'd probably just repeatedly divide by 10, adding together the remainders each time.
The first rule of performance optimization: Don't divide when you can multiply instead. The following function will take four digit numbers 0-9999 and do what you ask. The intermediate calculations are larger than 16 bits. We multiple the number by 1/10000 and take the result as a Q16 fixed point number. Digits are then extracted by multiplication by 10 and taking the integer part.
#define TEN_OVER_10000 ((1<<25)/1000 +1) // .001 Q25
int sum_digits(unsigned int n)
{
int c;
int sum = 0;
n = (n * TEN_OVER_10000)>>9; // n*10/10000 Q16
for (c=0;c<4;c++)
{
printf("Digit: %d\n", n>>16);
sum += n>>16;
n = (n & 0xffff) * 10; // next digit
}
return sum;
}
This can be extended to larger sizes but its tricky. You need to ensure that the rounding in the fixed point calculation always works correctly. I also did 4 digit numbers so the intermediate result of the fixed point multiply would not overflow.
Int64 BigNumber = 9223372036854775807;
String BigNumberStr = BigNumber.ToString();
int Sum = 0;
foreach (Char c in BigNumberStr)
Sum += (byte)c;
// 48 is ascii value of zero
// remove in one step rather than in the loop
Sum -= 48 * BigNumberStr.Length;
Instead of int.parse, why not subtract '0' from each digit to get the actual value.
Remember, '9' - '0' = 9, so you should be able to do this in order k (length of the number). The subtraction is just one operation, so that should not slow things down.

Simple Pseudo-Random Algorithm

I'm need a pseudo-random generator which takes a number as input and returns another number witch is reproducible and seems to be random.
Each input number should match to exactly one output number and vice versa
same input numbers always result in same output numbers
sequential input numbers that are close together (eg. 1 and 2) should produce completely different output numbers (eg. 1 => 9783526, 2 => 283)
It must not be perfect, it's just to create random but reproducible test data.
I use C#.
I wrote this funny piece of code some time ago which produced something random.
public static long Scramble(long number, long max)
{
// some random values
long[] scramblers = { 3, 5, 7, 31, 343, 2348, 89897 };
number += (max / 7) + 6;
number %= max;
// shuffle according to divisibility
foreach (long scrambler in scramblers)
{
if (scrambler >= max / 3) break;
number = ((number * scrambler) % max)
+ ((number * scrambler) / max);
}
return number % max;
}
I would like to have something better, more reliable, working with any size of number (no max argument).
Could this probably be solved using a CRC algorithm? Or some bit shuffling stuff.
I remove the microsoft code from this answer, the GNU code file is a lot longer but basically it contains this from http://cs.uccs.edu/~cs591/bufferOverflow/glibc-2.2.4/stdlib/random_r.c :
int32_t val = state[0];
val = ((state[0] * 1103515245) + 12345) & 0x7fffffff;
state[0] = val;
*result = val;
for your purpose, the seed is state[0] so it would look more like
int getRand(int val)
{
return ((val * 1103515245) + 12345) & 0x7fffffff;
}
You (maybe) can do this easily in C# using the Random class:
public int GetPseudoRandomNumber(int input)
{
Random random = new Random(input);
return random.Next();
}
Since you're explicitly seeding Random with the input, you will get the same output every time given the same input value.
A tausworthe generator is simple to implement and pretty fast. The following pseudocode implementation has full cycle (2**31 - 1, because zero is a fixed point):
def tausworthe(seed)
seed ^= seed >> 13
seed ^= seed << 18
return seed & 0x7fffffff
I don't know C#, but I'm assuming it has XOR (^) and bit shift (<<, >>) operators as in C.
Set an initial seed value, and invoke with seed = tausworthe(seed).
The first two rules suggest a fixed or input-seeded permutation of the input, but the third rule requires a further transform.
Is there any further restriction on what the outputs should be, to guide that transform? - e.g. is there an input set of output values to choose from?
If the only guide is "no max", I'd use the following...
Apply a hash algorithm to the whole input to get the first output item. A CRC might work, but for more "random" results, use a crypto hash algorithm such as MD5.
Use a next permutation algorithm (plenty of links on Google) on the input.
Repeat the hash-then-next-permutation until all required outputs are found.
The next permutation may be overkill though, you could probably just increment the first input (and maybe, on overflow, increment the second and so on) before redoing the hash.
For crypto-style hashing, you'll need a key - just derive something from the input before you start.

Categories