Constructing Hash Function for integer array [duplicate] - c#

This question already has answers here:
C# hashcode for array of ints
(9 answers)
Closed 9 years ago.
I have an array of int, I want to create a hash function for it, so that two integer arrays with different elements results in the same hash values for low possibility, what is the best way to do that?
The length of array could be up to 500, the integer number could be from 0 to 50.
Note that there is not exact duplicate of the question, as the nature of integer array (length and range of number) is different.
I use this before
public int GetHashCode(int[] data)
{
if (data == null)
return 0;
int result = 17;
foreach (var value in data)
{
result += result * 23 + value;
}
return result;
}
but I discover it has many collision.
What I want to solve is to construct a dictionary<int[], string> so that when integer of the same values should results in different Hashcode.

two integer arrays with different elements do not result in the same hash values
This is not possible for arrays with more than one element. An array with N elements has 32*N bits of information, you cannot map it to the 32 bits of the hash code without losing some information, unless N=1.
For N>1 there will be a very large number of array pairs for which the hash code is the same, while the arrays are different. There are techniques that make it less likely that a pair of arrays chosen at random would have the same hash code, but it is not possible to eliminate collisions completely for the general case.
The length of array could be up to 500, the integer number could be from 0 to 50
You need approximately 2500 bits to represent an array like that; your hash value has only 32 bits, so you will have lots of hash collisions as well. You can do a perfect hash for arrays of zero to five elements with values 0..50 by packing the numbers in an int (use value 51 to represent "a missing value" so that you could pack arrays of different length). Once you need to add the sixths number to the mix, your hash would not longer be perfect.

500 values form 0 to 50 means you can store the sum of all values multiplied by 50 and by position (starting from 0) also this can be reversed to extrapolate values
just check for size lenght and this has, and you should never find a collision

Related

Expressing a subset in binary

Given a list of 256 numbers in order (0-255), I want to express a subset of 128 numbers from that list. Every number will be unique and not repeated.
What is the most compact way to express this subset?
What I've come up with so far is having a 256 length bit-array and setting the appropriate indexes to 1. This method obviously requires 256 bits to represent the 128 values but is there a different, more space-saving way?
Thanks!
There is 256! / (128! * (256 - 128)!) unique combinations of 128 elements from a set of 256 items, when order does not matter (see wiki about combinations).
If you calculate that number and take base-2 logarithm - you will find that it's 251.6. That means you need at least 252 bits to represent unique selection of 128 items out of 256. Since .NET anyway cannot represent bits (only whole bytes) - there is no reason to actually find out how this could be done.
128 is the worst number in that regard. If you were selecting say 5 elements or 251 out of 256 - that could have been represented with 34 bits and it would have been useful to try and find that kind of effective representation.
Since you don't care about the order of the subset nor do you care about restoring each element to its position in the original array, this is simply a case of producing a random subset of an array, which is similar to drawing cards from a deck.
To take unique elements from an array, you can simply shuffle the source array and then take a number of elements at the first X indices:
int[] srcArray = Enumerable.Range(0, 256).ToArray();
Random r = new Random();
var subset = srcArray.OrderBy(i => r.Next()).Take(128).ToArray();
Note: I use the above randomizing method to keep the example concise. For a more robust shuffling approach, I recommend the Fisher-Yates algorithm as described in this post.

Memory-efficient way to store/compare x amount of trinary (?) values in C#

I have a list of entities, and for the purpose of analysis, an entity can be in one of three states. Of course I wish it was only two states, then I could represent that with a bool.
In most cases there will be a list of entities where the size of the list is usually 100 < n < 500.
I am working on analyzing the effects of the combinations of the entities and the states.
So if I have 1 entity, then I can have 3 combinations. If I have two entities, I can have six combinations, and so on.
Because of the amount of combinations, brute forcing this will be impractical (it needs to run on a single system). My task is to find good-but-not-necessarily-optimal solutions that could work. I don't need to test all possible permutations, I just need to find one that works. That is an implementation detail.
What I do need to do is to register the combinations possible for my current data set - this is basically to avoid duplicating the work of analyzing each combination. Every time a process arrives at a certain configuration of combinations, it needs to check if that combo is already being worked at or if it was resolved in the past.
So if I have x amount of tri-state values, what is an efficient way of storing and comparing this in memory? I realize there will be limitations here. Just trying to be as efficient as possible.
I can't think of a more effective unit of storage then two bits, where one of the four "bit states" is not used. But I don't know how to make this efficient. Do I need to make a choice on optimizing for storage size or performance?
How can something like this be modeled in C# in a way that wastes the least amount of resources and still performs relatively well when a process needs to ask "Has this particular combination of tri-state values already been tested?"?
Edit: As an example, say I have just 3 entities, and the state is represented by a simple integer, 1, 2 or 3. We would then have this list of combinations:
111
112
113
121
122
123
131
132
133
211
212
213
221
222
223
231
232
233
311
312
313
321
322
323
331
332
333
I think you can break this down as follows:
You have a set of N entities, each of which can have one of three different states.
Given one particular permutation of states for those N entities, you
want to remember that you have processed that permutation.
It therefore seems that you can treat the N entities as a base-3 number with 3 digits.
When considering one particular set of states for the N entities, you can store that as an array of N bytes where each byte can have the value 0, 1 or 2, corresponding to the three possible states.
That isn't a memory-efficient way of storing the states for one particular permutation, but that's OK because you don't need to store that array. You just need to store a single bit somewhere corresponding to that permutation.
So what you can do is to convert the byte array into a base 10 number that you can use as an index into a BitArray. You then use the BitArray to remember whether a particular permutation of states has been processed.
To convert a byte array representing a base three number to a decimal number, you can use this code:
public static int ToBase10(byte[] entityStates) // Each state can be 0, 1 or 2.
{
int result = 0;
for (int i = 0, n = 1; i < entityStates.Length; n *= 3, ++i)
result += n * entityStates[i];
return result;
}
Given that you have numEntities different entities, you can then create a BitArray like so:
int numEntities = 4;
int numPerms = (int)Math.Pow(numEntities, 3);
BitArray states = new BitArray(numPerms);
Then states can store a bit for each possible permutation of states for all the entities.
Let's suppose that you have 4 entities A, B, C and D, and you have a permutation of states (which will be 0, 1 or 2) as follows: A2 B1 C0 D1. That is, entity A has state 2, B has state 1, C has state 0 and D has state 1.
You would represent that as a boolean array like so:
byte[] permutation = { 2, 1, 0, 1 };
Then you can convert that to a base 10 number like so:
int asBase10 = ToBase10(permutation);
Then you can check if that permutation has been processed like so:
if (!bits[permAsBase10])
{
// Not processed, so process it.
process(permutation);
bits[permAsBase10] = true; // Remember that we processed it.
}
Without getting overly fancy with algorithms and data structures and assuming your tri-state values can be represented in strings and doesn't have a easily determined fix maximum amount. ie. "111", "112", etc (or even "1:1:1", "1:1:2") then a simple SortedSet may end up being fairly efficient.
As a bonus, it doesn't care about the number of values in your set.
SortedSet<string> alreadyTried = new SortedSet<string>();
if(!HasSetBeenTried("1:1:1"){
// do whatever
}
if(!HasSetBeenTried("500:212:100"){
// do whatever
}
public bool HasSetBeenTried(string set){
if(alreadyTried.Contains(set)) return false;
alreadyTried.Add(set);
return true;
}
Simple mathematic says:
3 entities in 3 states makes 27 combinations.
So you need exactly log(27)/log(2) = ~ 4.75 bits to store that information.
Because a pc only can make use of whole bits, you need to "waste" ~0.25 bits and use 5 bits per combination.
The more data you gather, the better you can pack that information, but in the end, maybe a compression algorithm could help even more.
Again: you only asked for memory efficiency, not performance.
In general you can calculate the bits you need by Math.Ceil(Math.Log( noCombinations , 2 )).

Lengths of cycles in random sequence

The following LINQPad code generates random sequence of unique integers from 0 to N and calculates the length of cycle for every integer starting from 0. In order to calculate cycle length for a given integer, it reads value from boxes array at the index equal to that integer, than takes the value and reads from the index equal to that value, and so on. The process stops when the value read from array is equal to original integer we started with. The number of iterations spent to calculating the length of every cycle gets saved into a Dictionary.
const int count = 100;
var random = new Random();
var boxes = Enumerable.Range(0, count).OrderBy(x => random.Next(0, count - 1)).ToArray();
string.Join(", ", boxes.Select(x => x.ToString())).Dump("Boxes");
var stats = Enumerable.Range(0, count).ToDictionary(x => x, x => {
var iterations = 0;
var ind = x;
while(boxes[ind] != x)
{
ind = boxes[ind];
iterations++;
}
return iterations;
});
stats.GroupBy(x => x.Value).Select(x => new {x.Key, Count = x.Count()}).OrderBy(x => x.Key).Dump("Stats");
stats.Sum(x => x.Value).Dump("Total Iterations");
Typical result looks as follows:
The results I am getting seem weird to me:
The lengths of all cycles can be grouped into only few buckets (usually 3 to 7). I was hoping to see more distinct buckets.
The number of elements in every bucket most of the time grows together with the bucket value they belong to. I was hoping that it would be more random.
I have tried several different randomize functions, like .NET's Random and RandomNumberGenerator classes, as well as random data generated from random.org. All of them seem to produce similar results.
Am I doing something wrong? Are those results expected from mathematical point of view? Or, perhaps, the pseudo nature of randomizing functions that I used have side effects?
What you are doing is generating a random permutation of size count. Then you check the properties of the permutation. If your random number generator is good, then you should observe the statistics of random permutations.
The average number of cycles of length k is 1/k, for k<count. On average, there is 1 fixed point, 1/2 cycles of length 2, 1/3 cycles of length 3, etc. The average number of cycles of any length is therefore 1+1/2+1/3+...+1/count ~ ln count + gamma. There are a lot of neat properties of the distribution of the number of cycles. Very occasionally there are many cycles, but the average value of 2^# cycles is count+1.
Your buckets correspond to the number of different cycle lengths, which is at most the number of cycles, but might be lower because of repeated cycle lengths. On average, few cycle lengths are repeated. Even as the count increases to infinity, and the average number of cycles increases to infinity, the average number of repeated cycle lengths stays finite.
In a permutation test in statistics, usually an example of bootstrapping, to analyze some types of data, you view it as an example of a permutation. For example, you might observe two quantities, x_i and y_i. You get a permutation by sorting the xs and ys, and seeing the index of the value of y paired with the kth x value. Then you compare statistics of this permutation with the properties of random permutations. This doesn't assume much about the underlying distributions, but it can still detect when x and y seem to be related. So, it's useful to know what to expect from random permutations.

Visual FoxPro (VFP) CTOBIN and BINTOC Functions - Equivalent In .Net

We are rewriting some applications previously developed in Visual FoxPro and redeveloping them using .Net ( using C# )
Here is our scenario:
Our application uses smartcards. We read in data from a smartcard which has a name and number. The name comes back ok in readable text but the number, in this case '900' comes back as a 2 byte character representation (131 & 132) and look like this - ƒ„
Those 2 special characters can be seen in the extended Ascii table.. now as you can see the 2 bytes are 131 and 132 and can vary as there is no single standard extended ascii table ( as far as I can tell reading some of the posts on here )
So... the smart card was previously written to using the BINTOC function in VFP and therefore the 900 was written to the card as ƒ„. And within foxpro those 2 special characters can be converted back into integer format using CTOBIN function.. another built in function in FoxPro..
So ( finally getting to the point ) - So far we have been unable to convert those 2 special characters back to an int ( 900 ) and we are wondering if this is possible in .NET to read the character representation of an integer back to an actual integer.
Or is there a way to rewrite the logic of those 2 VFP functions in C#?
UPDATE:
After some fiddling we realise that to get 900 into 2bytes we need to convert 900 into a 16bit Binary Value, then we need to convert that 16 bit binary value into a decimal value.
So as above we are receiving back 131 and 132 and their corresponding binary values as being 10000011 ( decimal value 131 ) and 10000100 ( decimal value 132 ).
When we concatenate these 2 values to '1000001110000100' it gives the decimal value 33668 however if we removed the leading 1 and transform '000001110000100' to decimal it gives the correct value of 900...
Not too sure why this is though...
Any help would be appreciated.
It looks like VFP is storing your value as a signed 16 bit (short) integer. It seems to have a strange changeover point to me for the negative numbers but it adds 128 to 8 bit numbers and adds 32768 to 16 bit numbers.
So converting your 16 bit numbers from the string should be as easy as reading it as a 16 bit integer and then taking 32768 away from it. If you have to do this manually then the first number has to be multiplied by 256 and then add the second number to get the stored value. Then take 32768 away from this number to get your value.
Examples:
131 * 256 = 33536
33536 + 132 = 33668
33668 - 32768 = 900
You could try using the C# conversions as per http://msdn.microsoft.com/en-us/library/ms131059.aspx and http://msdn.microsoft.com/en-us/library/tw38dw27.aspx to do at least some of the work for you but if not it shouldn't be too hard to code the above manually.
It's a few years late, but here's a working example.
public ulong CharToBin(byte[] s)
{
if (s == null || s.Length < 1 || s.Length > 8)
return 0ul;
var v = s.Select(c => (ulong)c).ToArray();
var result = 0ul;
var multiplier = 1ul;
for (var i = 0; i < v.Length; i++)
{
if (i > 0)
multiplier *= 256ul;
result += v[i] * multiplier;
}
return result;
}
This is a VFP 8 and earlier equivalent for CTOBIN, which covers your scenario. You should be able to write your own BINTOC based on the code above. VFP 9 added support for multiple options like non-reversed binary data, currency and double data types, and signed values. This sample only covers reversed unsigned binary like older VFP supported.
Some notes:
The code supports 1, 2, 4, and 8-byte values, which covers all
unsigned numeric values up to System.UInt64.
Before casting the
result down to your expected numeric type, you should verify the
ceiling. For example, if you need an Int32, then check the result
against Int32.MaxValue before you perform the cast.
The sample avoids the complexity of string encoding by accepting a
byte array. You would need to understand which encoding was used to
read the string, then apply that same encoding to get the byte array
before calling this function. In the VFP world, this is frequently
Encoding.ASCII, but it depends on the application.

Reading and setting base 3 digits from base 2 integer

Part of my application data contains a set of 9 ternary (base-3) "bits". To keep the data compact for the database, I would like to store that data as a single short. Since 3^9 < 2^15 I can represent any possible 9 digit base-3 number as a short.
My current method is to work with it as a string of length 9. I can read or set any digit by index, and it is nice and easy. To convert it to a short though, I am currently converting to base 10 by hand (using a shift-add loop) and then using Int16.Parse to convert it back to a binary short. To convert a stored value back to the base 3 string, I run the process in reverse. All of this takes time, and I would like to optimize it if at all possible.
What I would like to do is always store the value as a short, and read and set ternary bits in place. Ideally, I would have functions to get and set individual digits from the binary in place.
I have tried playing with some bit shifts and mod functions, but havn't quite come up with the right way to do this. I'm not even sure if it is even possible without going through the full conversion.
Can anyone give me any bitwise arithmetic magic that can help out with this?
public class Base3Handler
{
private static int[] idx = {1, 3, 9, 27, 81, 243, 729, 729*3, 729*9, 729*81};
public static byte ReadBase3Bit(short n, byte position)
{
if ((position > 8) || (position < 0))
throw new Exception("Out of range...");
return (byte)((n%idx[position + 1])/idx[position]);
}
public static short WriteBase3Bit(short n, byte position, byte newBit)
{
byte oldBit = ReadBase3Bit(n, position);
return (short) (n + (newBit - oldBit)*idx[position]);
}
}
These are small numbers. Store them as you wish, efficiently in memory, but then use a table lookup to convert from one form to another as needed.
You can't do bit operations on ternary values. You need to use multiply, divide and modulo to extract and combine values.
To use bit operations you need to limit the packing to 8 ternaries per short (i.e. 2 bits each)

Categories