Expressing a subset in binary

Expressing a subset in binary - c#

Given a list of 256 numbers in order (0-255), I want to express a subset of 128 numbers from that list. Every number will be unique and not repeated.
What is the most compact way to express this subset?
What I've come up with so far is having a 256 length bit-array and setting the appropriate indexes to 1. This method obviously requires 256 bits to represent the 128 values but is there a different, more space-saving way?
Thanks!

There is 256! / (128! * (256 - 128)!) unique combinations of 128 elements from a set of 256 items, when order does not matter (see wiki about combinations).
If you calculate that number and take base-2 logarithm - you will find that it's 251.6. That means you need at least 252 bits to represent unique selection of 128 items out of 256. Since .NET anyway cannot represent bits (only whole bytes) - there is no reason to actually find out how this could be done.
128 is the worst number in that regard. If you were selecting say 5 elements or 251 out of 256 - that could have been represented with 34 bits and it would have been useful to try and find that kind of effective representation.

Since you don't care about the order of the subset nor do you care about restoring each element to its position in the original array, this is simply a case of producing a random subset of an array, which is similar to drawing cards from a deck.
To take unique elements from an array, you can simply shuffle the source array and then take a number of elements at the first X indices:
int[] srcArray = Enumerable.Range(0, 256).ToArray();
Random r = new Random();
var subset = srcArray.OrderBy(i => r.Next()).Take(128).ToArray();
Note: I use the above randomizing method to keep the example concise. For a more robust shuffling approach, I recommend the Fisher-Yates algorithm as described in this post.

Related

Memory-efficient way to store/compare x amount of trinary (?) values in C#

I have a list of entities, and for the purpose of analysis, an entity can be in one of three states. Of course I wish it was only two states, then I could represent that with a bool.
In most cases there will be a list of entities where the size of the list is usually 100 < n < 500.
I am working on analyzing the effects of the combinations of the entities and the states.
So if I have 1 entity, then I can have 3 combinations. If I have two entities, I can have six combinations, and so on.
Because of the amount of combinations, brute forcing this will be impractical (it needs to run on a single system). My task is to find good-but-not-necessarily-optimal solutions that could work. I don't need to test all possible permutations, I just need to find one that works. That is an implementation detail.
What I do need to do is to register the combinations possible for my current data set - this is basically to avoid duplicating the work of analyzing each combination. Every time a process arrives at a certain configuration of combinations, it needs to check if that combo is already being worked at or if it was resolved in the past.
So if I have x amount of tri-state values, what is an efficient way of storing and comparing this in memory? I realize there will be limitations here. Just trying to be as efficient as possible.
I can't think of a more effective unit of storage then two bits, where one of the four "bit states" is not used. But I don't know how to make this efficient. Do I need to make a choice on optimizing for storage size or performance?
How can something like this be modeled in C# in a way that wastes the least amount of resources and still performs relatively well when a process needs to ask "Has this particular combination of tri-state values already been tested?"?
Edit: As an example, say I have just 3 entities, and the state is represented by a simple integer, 1, 2 or 3. We would then have this list of combinations:
111
112
113
121
122
123
131
132
133
211
212
213
221
222
223
231
232
233
311
312
313
321
322
323
331
332
333

I think you can break this down as follows:
You have a set of N entities, each of which can have one of three different states.
Given one particular permutation of states for those N entities, you
want to remember that you have processed that permutation.
It therefore seems that you can treat the N entities as a base-3 number with 3 digits.
When considering one particular set of states for the N entities, you can store that as an array of N bytes where each byte can have the value 0, 1 or 2, corresponding to the three possible states.
That isn't a memory-efficient way of storing the states for one particular permutation, but that's OK because you don't need to store that array. You just need to store a single bit somewhere corresponding to that permutation.
So what you can do is to convert the byte array into a base 10 number that you can use as an index into a BitArray. You then use the BitArray to remember whether a particular permutation of states has been processed.
To convert a byte array representing a base three number to a decimal number, you can use this code:
public static int ToBase10(byte[] entityStates) // Each state can be 0, 1 or 2.
{
int result = 0;
for (int i = 0, n = 1; i < entityStates.Length; n *= 3, ++i)
result += n * entityStates[i];
return result;
}
Given that you have numEntities different entities, you can then create a BitArray like so:
int numEntities = 4;
int numPerms = (int)Math.Pow(numEntities, 3);
BitArray states = new BitArray(numPerms);
Then states can store a bit for each possible permutation of states for all the entities.
Let's suppose that you have 4 entities A, B, C and D, and you have a permutation of states (which will be 0, 1 or 2) as follows: A2 B1 C0 D1. That is, entity A has state 2, B has state 1, C has state 0 and D has state 1.
You would represent that as a boolean array like so:
byte[] permutation = { 2, 1, 0, 1 };
Then you can convert that to a base 10 number like so:
int asBase10 = ToBase10(permutation);
Then you can check if that permutation has been processed like so:
if (!bits[permAsBase10])
{
// Not processed, so process it.
process(permutation);
bits[permAsBase10] = true; // Remember that we processed it.
}

Without getting overly fancy with algorithms and data structures and assuming your tri-state values can be represented in strings and doesn't have a easily determined fix maximum amount. ie. "111", "112", etc (or even "1:1:1", "1:1:2") then a simple SortedSet may end up being fairly efficient.
As a bonus, it doesn't care about the number of values in your set.
SortedSet<string> alreadyTried = new SortedSet<string>();
if(!HasSetBeenTried("1:1:1"){
// do whatever
}
if(!HasSetBeenTried("500:212:100"){
// do whatever
}
public bool HasSetBeenTried(string set){
if(alreadyTried.Contains(set)) return false;
alreadyTried.Add(set);
return true;
}

Simple mathematic says:
3 entities in 3 states makes 27 combinations.
So you need exactly log(27)/log(2) = ~ 4.75 bits to store that information.
Because a pc only can make use of whole bits, you need to "waste" ~0.25 bits and use 5 bits per combination.
The more data you gather, the better you can pack that information, but in the end, maybe a compression algorithm could help even more.
Again: you only asked for memory efficiency, not performance.
In general you can calculate the bits you need by Math.Ceil(Math.Log( noCombinations , 2 )).

Constructing Hash Function for integer array [duplicate]

This question already has answers here:
C# hashcode for array of ints
(9 answers)
Closed 9 years ago.
I have an array of int, I want to create a hash function for it, so that two integer arrays with different elements results in the same hash values for low possibility, what is the best way to do that?
The length of array could be up to 500, the integer number could be from 0 to 50.
Note that there is not exact duplicate of the question, as the nature of integer array (length and range of number) is different.
I use this before
public int GetHashCode(int[] data)
{
if (data == null)
return 0;
int result = 17;
foreach (var value in data)
{
result += result * 23 + value;
}
return result;
}
but I discover it has many collision.
What I want to solve is to construct a dictionary<int[], string> so that when integer of the same values should results in different Hashcode.

two integer arrays with different elements do not result in the same hash values
This is not possible for arrays with more than one element. An array with N elements has 32*N bits of information, you cannot map it to the 32 bits of the hash code without losing some information, unless N=1.
For N>1 there will be a very large number of array pairs for which the hash code is the same, while the arrays are different. There are techniques that make it less likely that a pair of arrays chosen at random would have the same hash code, but it is not possible to eliminate collisions completely for the general case.
The length of array could be up to 500, the integer number could be from 0 to 50
You need approximately 2500 bits to represent an array like that; your hash value has only 32 bits, so you will have lots of hash collisions as well. You can do a perfect hash for arrays of zero to five elements with values 0..50 by packing the numbers in an int (use value 51 to represent "a missing value" so that you could pack arrays of different length). Once you need to add the sixths number to the mix, your hash would not longer be perfect.

500 values form 0 to 50 means you can store the sum of all values multiplied by 50 and by position (starting from 0) also this can be reversed to extrapolate values
just check for size lenght and this has, and you should never find a collision

Which part of a GUID is most worth keeping?

I need to generate a unique ID and was considering Guid.NewGuid to do this, which generates something of the form:
0fe66778-c4a8-4f93-9bda-366224df6f11
This is a little long for the string-type database column that it will end up residing in, so I was planning on truncating it.
The question is: Is one end of a GUID more preferable than the rest in terms of uniqueness? Should I be lopping off the start, the end, or removing parts from the middle? Or does it just not matter?

You can save space by using a base64 string instead:
var g = Guid.NewGuid();
var s = Convert.ToBase64String(g.ToByteArray());
Console.WriteLine(g);
Console.WriteLine(s);
This will save you 12 characters (8 if you weren't using the hyphens).

Keep all of it.
From the above link:
* Four bits to encode the computer number,
* 56 bits for the timestamp, and
* four bits as a uniquifier.
you can redefine the Guid to right-size it to your needs.

If the GUID were simply a random number, you could keep an arbitrary subset of the bits and suffer a certain percent chance of collision that you can calculate with the "birthday algorithm":
double numBirthdays = 365; // set to e.g. 18446744073709551616d for 64 bits
double numPeople = 23; // set to the maximum number of GUIDs you intend to store
double probability = 1; // that all birthdays are different
for (int x = 1; x < numPeople; x++)
probability *= (double)(numBirthdays - x) / numBirthdays;
Console.WriteLine("Probability that two people have the same birthday:");
Console.WriteLine((1 - probability).ToString());
However, often the probability of a collision is higher because, as a matter of fact, GUIDs are in general NOT random. According to Wikipedia's GUID article there are five types of GUIDs. The 13th digit specifies which kind of GUID you have, so it tends not to vary much, and the top two bits of the 17th digit are always fixed at 01.
For each type of GUID you'll get different degrees of randomness. Version 4 (13th digit = 4) is entirely random except for digits 13 and 17; versions 3 and 5 are effectively random, as they are cryptographic hashes; while versions 1 and 2 are mostly NOT random but certain parts are fairly random in practical cases. A "gotcha" for version 1 and 2 GUIDs is that many GUIDs could come from the same machine and in that case will have a large number of identical bits (in particular, the last 48 bits and many of the time bits will be identical). Or, if many GUIDs were created at the same time on different machines, you could have collisions between the time bits. So, good luck safely truncating that.
I had a situation where my software only supported 64 bits for unique IDs so I couldn't use GUIDs directly. Luckily all of the GUIDs were type 4, so I could get 64 bits that were random or nearly random. I had two million records to store, and the birthday algorithm indicated that the probability of a collision was 1.08420141198273 x 10^-07 for 64 bits and 0.007 (0.7%) for 48 bits. This should be assumed to be the best-case scenario, since a decrease in randomness will usually increase the probability of collision.
I suppose that in theory, more GUID types could exist in the future than are defined now, so a future-proof truncation algorithm is not possible.

I agree with Rob - Keep all of it.
But since you said you're going into a database, I thought I'd point out that just using Guid's doesn't necessarily mean that it will index well in a database. For that reason, the NHibernate developers created a Guid.Comb algorithm that's more DB friendly.
See NHibernate POID Generators revealed and documentation on the Guid Algorithms for more information.
NOTE: Guid.Comb is designed to improve performance on MsSQL

Truncating a GUID is a bad idea, please see this article for why.
You should consider generating a shorter GUID, as google reveals some solutions for. These solutions seem to involve taking a GUID and changing it to be represented in full 255 bit ascii.

Probability of repeating results using rand.Next()

Looking at another question of mine I realized that technically there is nothing preventing this algorithm from running for an infinite period of time. (IE: It never returns)
Because of the chance that rand.Next(1, 100000); could theoretically keep generating the same value.
Out of curiosity; how would I calculate the probability of this happening? I assume it would be very small?
Code from other question:
Random rand = new Random();
List<Int32> result = new List<Int32>();
for (Int32 i = 0; i < 300; i++)
{
Int32 curValue = rand.Next(1, 100000);
while (result.Exists(value => value == curValue))
{
curValue = rand.Next(1, 100000);
}
result.Add(curValue);
}

On ONE given draw of a random number, the probability of repeating a value readily found in the result list is
P(Collision) = i * 1/100000 where i is the number of values in the list.
That is because all 100,000 possible numbers are assumed to have the same probability of being drawn (assumption of a uniform distribution) and the drawing of any number is independent from that of drawing any other number.
The probability of experiencing such a "collision" with the numbers from the list several several times in a row is
P(n Collisions) = P(Collision) ^ n
where n is the number of times a collision happens
That is because the drawings are independent.
Numerically...
when the list is half full, i = 150 and
P(Collision) = 0.15% = 0.0015 and
P(2 Collisions) = 0.00000225
P(3 Collisions) - 0.000000003375
P(4 Collisions) = 0.0000000000050265
when the list is all full but for the last one, i = 299 and
P(Collision) = 0.299% = 0.00299 and
P(2 Collisions) = 0.0000089401 (approx)
P(3 Collisions) = 0.00000002673 (approx)
P(4 Collisions) = 0.000000000079925 (approx)
You are therefore right to assume that the probability of having to draw multiple times for finding the next suitable value to add to the array is very small, and should therefore not impact the overall performance of the snippet. Beware that there will be a few retries (statistically speaking), but the total number of retries will be small compared to 300.
If however the total number of item desired in the list was to increase much, or if the range of random number sought was to be reduced, P(Collision) would not be so small and hence the number of "retries" needed would grow accordingly. That is why other algorithms exist for drawings multiple values without replacement; most are based on the idea of using the random number as an index into a array of all the remaining values.

Assuming a uniform distribution (not a bad assumption, I believe) the chance of getting the number n times in a row is (0.00001)^n.

It's quite possible for a PRNG to generate the same number in a limited range in consecutive calls. The probability would be a function of the bit-size of the raw PRNG and the method used to reduce that size to the numeric range you want (in this case 1 - 100000).

To answer your question exactly, no, it isn't very small, the probability of it going on for an infinite period of time "is" 0. I say "is" because it actually tends to 0 when the number of iterations tends to infinity.
As bdares said, it will tend to 0 with (1/range)ˆn , with n being the number of iterations, if we can assume an uniform distribution (this says we kinda can).

This program will not halt if:
A random number is picked that is in the result set
That number generates a cycle (i.e. a loop) in the random number generator's algorithm (they all do)
All numbers in the loop are already in the result set
All random number generators eventually loop back on themselves, due to the limited number of integers possible ==> for 32-bit, only 2^32 possible values.
"Good" generators have very large loops. "Poor" algorithms yield short loops for certain values. Consult Knuth's The Art of Computer Programming for random number generators. It is a fascinating read.
Now, assuming there is a cycle of (n) numbers. For your program, which loops 300 times, that means (n) <= 300. Also, the number of attempts you try before you hit on a number in this cycle, plus the length of the cycle, must not be greater than 300. Therefore, assuming the first try you hit on the cycle, then the cycle can be 300 long. If on the second try you hit the cycle, it can only be 299 long.
Assuming that most random number generation algorithms have reasonably-flat probability distribution, the probability of hitting a 300-cycle the first time is (300/2^32), multiplied by the probability of having a 300-cycle (this depends on the rand algorithm), plus the probability of hitting a 299-cycle the first time (299/2^32) x probability of having a 299-cycle, etc. And so on and so forth. Then add up the second try, third try, all the way up to the 300-th try (which can only be a 1-cycle).
Now this is assuming that any number can take on the full 2^32 generator space. If you are limiting it to 100000 only, then in essence you increase the chance of having much shorter cycles, because multiple numbers (in the 2^32 space) can map to the same number in "real" 100000 space.
In reality, most random generator algorithms have minimum cycle lengths of > 300. A random generator implementation based on the simplest LCG (linear congruential generator, wikipedia) can have a "full period" (i.e. 2^32) with the correct choice of parameters. So it is safe to say that minimum cycle lengths are definitely > 300. If this is the case, then it depends on the mapping algorithm of the generator to map 2^32 numbers into 100000 numbers. Good mappers will not create 300-cycles, poor mappers may create short cycles.

Hashtable/Dictionary collisions

Using the standard English letters and underscore only, how many characters can be used at a maximum without causing a potential collision in a hashtable/dictionary.
So strings like:
blur
Blur
b
Blur_The_Shades_Slightly_With_A_Tint_Of_Blue
...

There's no guarantee that you won't get a collision between single letters.
You probably won't, but the algorithm used in string.GetHashCode isn't specified, and could change. (In particular it changed between .NET 1.1 and .NET 2.0, which burned people who assumed it wouldn't change.)
Note that hash code collisions won't stop well-designed hashtables from working - you should still be able to get the right values out, it'll just potentially need to check more than one key using equality if they've got the same hash code.
Any dictionary which relies on hash codes being unique is missing important information about hash codes, IMO :) (Unless it's operating under very specific conditions where it absolutely knows they'll be unique, i.e. it's using a perfect hash function.)

Given a perfect hashing function (which you're not typically going to have, as others have mentioned), you can find the maximum possible number of characters that guarantees no two strings will produce a collision, as follows:
No. of unique hash codes avilable = 2 ^ 32 = 4294967296 (assuming an 32-bit integer is used for hash codes)
Size of character set = 2 * 26 + 1 = 53 (26 lower as upper case letters in the Latin alphabet, plus underscore)
Then you must consider that a string of length l (or less) has a total of 54 ^ l representations. Note that the base is 54 rather than 53 because the string can terminate after any character, adding an extra possibility per char - not that it greatly effects the result.
Taking the no. of unique hash codes as your maximum number of string representations, you get the following simple equation:
54 ^ l = 2 ^ 32
And solving it:
log2 (54 ^ l) = 32
l * log2 54 = 32
l = 32 / log2 54 = 5.56
(Where log2 is the logarithm function of base 2.)
Since string lengths clearly can't be fractional, you take the integral part to give a maximum length of just 5. Very short indeed, but observe that this restriction would prevent even the remotest chance of a collision given a perfect hash function.
This is largely theoretical however, as I've mentioned, and I'm not sure of how much use it might be in the design consideration of anything. Saying that, hopefully it should help you understand the matter from a theoretical viewpoint, on top of which you can add the practical considersations (e.g. non-perfect hash functions, non-uniformity of distribution).

Universal Hashing
To calculate the probability of collisions with S strings of length L with W bits per character to a hash of length H bits assuming an optimal universal hash (1) you could calculate the collision probability based on a hash table of size (number of buckets) 'N`.
First things first we can assume a ideal hashtable implementation (2) that splits the H bits in the hash perfectly into the available buckets N(3). This means H becomes meaningless except as a limit for N.
W and 'L' are simply the basis for an upper bound for S. For simpler maths assume that strings length < L are simply padded to L with a special null character. If we were interested we are interested in the worst case this is 54^L (26*2+'_'+ null), plainly this is a ludicrous number, the actual number of entries is more useful than the character set and the length so we will simply work as if S was a variable in it's own right.
We are left trying to put S items into N buckets.
This then becomes a very well known problem, the birthday paradox
Solving this for various probabilities and number of buckets is instructive but assuming we have 1 billion buckets (so about 4GB of memory in a 32 bit system) then we would need only 37K entries before we hit a 50% chance of their being at least one collision. Given that trying to avoid any collisions in a hashtable becomes plainly absurd.
All this does not mean that we should not care about the behaviour of our hash functions. Clearly these numbers are assuming ideal implementations, they are an upper bound on how good we can get. A poor hash function can give far worse collisions in some areas, waste some of the possible 'space' by never or rarely using it all of which can cause hashes to be less than optimal and even degrade to a performance that looks like a list but with much worse constant factors.
The .NET framework's implementation of the string's hash function is not great (in that it could be better) but is probably acceptable for the vast majority of users and is reasonably efficient to calculate.
An Alternative Approach: Perfect Hashing
If you wish you can generate what are known as perfect hashes this requires full knowledge of the input values in advance however so is not often useful. In a simliar vein to the above maths we can show that even perfect hashing has it's limits:
Recall the limit of of 54 ^ L strings of length L. However we only have H bits (we shall assume 32) which is about 4 billion different numbers. So if you can have truly any string and any number of them then you have to satisfy:
54 ^ L <= 2 ^ 32
And solving it:
log2 (54 ^ L) <= 32
L * log2 54 <= 32
L <= 32 / log2 54 <= 5.56
Since string lengths clearly can't be fractional, you are left with a maximum length of just 5. Very short indeed.
If you know that you will only ever have a set of strings well below 4 Billion in size then perfect hashing would let you handle any value of L, but restricting the set of values can be very hard in practice and you must know them all in advance or degrade to what amounts to a database of string -> hash and add to it as new strings are encountered.
For this exercise the universal hash is optimal as we wish to reduce the probability of any collision i.e. for any input the probability of it having output x from a set of possibilities R is 1/R.
Note that doing an optimal job on the hashing (and the internal bucketing) is quite hard but that you should expect the built in types to be reasonable if not always ideal.
In this example I have avoided the question of closed and open addressing. This does have some bearing on the probabilities involved but not significantly

A hash algorithm isn't supposed to guarantee uniqueness. Given that there are far more potential strings (26^n for n length, even ignoring special chars, spaces, capitalization, non-english chars, etc.) than there are places in your hashtable, there's no way such a guarantee could be fulfilled. It's only supposed to guarantee a good distribution.

If your key is a string (e.g., a Dictionary) then it's GetHashCode() will be used. That's a 32bit integer. Hashtable defaults to a 1 key to value load factor and increases the number of buckets to maintain that load factor. So if you do see collisions they should tend to occur around reallocation boundaries (and decrease shortly after reallocation).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.