I noticed that hashcodes I got from other objects were different when I built for a either x86 or x64.
Up until now I have implemented most of my own hashing functions like this:
int someIntValueA;
int someIntValueB;
const int SHORT_MASK = 0xFFFF;
public override int GetHashCode()
{
return (someIntValueA & SHORT_MASK) + ((someIntValueB & SHORT_MASK) << 16);
}
Will storing the values in a long and getting the hashcode from that give me a wider range as well on 64-bit systems, or is this a bad idea?
public override int GetHashCode()
{
long maybeBiggerSpectrumPossible = someIntValueA + (someIntValueB << 32);
return maybeBiggerSpectrumPossible.GetHashCode();
}
No, that will be far worse.
Suppose your int values are typically in the range of a short: between -30000 and +30000. And suppose further that most of them are near the middle, say, between 0 and 1000. That's pretty typical. With your first hash code you get all the bits of both ints into the hash code and they don't interfere with each other; the number of collisions is zero under typical conditions.
But when you do your trick with a long, then you rely on what the long implementation of GetHashCode does, which is xor the upper 32 bits with the lower 32 bits. So your new implementation is just a slow way of writing int1 ^ int2. Which, in the typical scenario has almost all zero bits, and hence collisions all over the place.
The approach you suggest won't make anything any better (quite the opposite).
However…
SpookyHash is for example designed to work particularly quickly on 64-bit systems, because when working out the math the author was thinking about what would be fast on a 64-bit system, xxHash has 32-bit and 64-bit variants that are designed to give comparable quality of hash at better speed for 32-bit and 64-bit computation respectively.
The general idea of making use of the differences performances of different arithmetic operations on different machines is a valid one.
And your general idea of making use of a larger intermediary storage in hash calculation is also a valid one as long as those extra bits make their way into subsequent operations.
So at a very general level, the answer is yes, even if your particular implementation fails to come through with that.
Now, in practice, when you're sitting down to write a hashcode implementation should you worry about this?
Well it depends. For a while I was very bullish about using algorithms like SpookyHash, and it does very well (even on 32-bit systems) when the hash is based on a large amount of source data. But on the other hand it can be better, especially when used with smaller hash-based sets and dictionaries, to be crappy really fast than fantastic slowly. So there isn't an one-solution-fits-all answer. With just two input integers your initial solution is likely to beat a super-avalancy algorithm like xxHash or SpookyHash for many uses. You could perhaps do better if you also had a >> 16 to rotate rather than shift (fun fact, some jitters are optimised for that), but we're not touching on 64- vs 32-bit versions in that at all.
The cases where you do find a big possible improvement with taking a different approach in 64- and 32-bit are where there's a large amount of data to mix in, especially if it's in a blittable form (like string or byte[]) that you can access via a long* or int* depending on framework.
So, generally you can ignore the question of bitness, but if you find yourself thinking "this hashcode has to go through so much stuff to get an answer; can I make it better?" then maybe it's time to consider such matters.
Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Ideally the sequence should be something like an IEnumerable<int>, and it lazily returns the next number in sequence, only when requested.
I did some quick research and I found some partial solutions to this:
Using a maximal linear feedback shift register - if I understood correctly, it only generates the numbers in increasing sequence and it does not cover the entire range
Using the Fisher-Yates or other shuffling algorithms over collections - this would violate the memory restrictions given the large range
Maintaining a set-like collection and keep generating a random integer (perhaps using Random) until it doesn't repeat, i.e. it's not in the set - apart from possibly failing to satisfy the memory requirements, it would get ridiculously slow when generating the last numbers in the sequence.
Random permutations over 32 bits, however I can't think of a way to ensure non-repeatability.
Is there another way to look at this problem - perhaps taking advantage of the fixed range of values - that would give a solution satisfying the memory requirements? Maybe the .NET class libraries come with something useful?
UPDATE 1
Thanks everyone for your insights and creative suggestions for a solution. I'll try to implement and test soon (both for correctness and memory efficiency) the 2 or 3 most promising solutions proposed here, post the results and then pick a "winner".
UPDATE 2
I tried implementing hvd's suggestion in the comment below. I tried using both the BitArray in .NET and my custom implementation, since the .NET one is limited to int.MaxValue entries, so not enough to cover the entire range of integers.
I liked the simplicity of the idea and i was willing to "sacrifice" those 512 MB of memory if it worked fine. Unfortunately, the run time is quite slow, spending up to tens of seconds to generate the next random number on my machine, which has a 3.5 GHz Core i7 CPU. So unfortunately this is not acceptable if you request many, many random numbers to be generated. I guess it's predictable though, it's a O(M x N) algorithm if i'm not mistaken, where N is 2^32 and M is the number of requested integers, so all those iterations take their toll.
Ideally i'd like to generate the next random number in O(1) time, while still meeting the memory requirements, maybe the next algorithms suggested here are suited for this. I'll give them a try as soon as I can.
UPDATE 3
I just tested the Linear Congruential Generator and i can say i'm quite pleased with the results. It looks like a strong contender for the winner position in this thread.
Correctness: all integers generated exactly once (i used a bit vector to check this).
Randomness: fairly good.
Memory usage: Excellent, just a few bytes.
Run time: Generates the next random integer very fast, as you can expect from an O(1) algorithm. Generating every integer took a total of approx. 11 seconds on my machine.
All in all i'd say it's a very appropriate technique if you're not looking for highly randomized sequences.
UPDATE 4
The modular multiplicative inverse technique described below behaves quite similarly to the LCG technique - not surprising, since both are based on modular arithmetic -, although i found it a bit less straightforward to implement in order to yield satisfactorily random sequences.
One interesting difference I found is that this technique seems faster than LCG: generating the entire sequence took about 8 seconds, versus 11 seconds for LCG. Other than this, all other remarks about memory efficiency, correctness and randomness are the same.
UPDATE 5
Looks like user TomTom deleted their answer with the Mersenne Twister without notice after i pointed out in a comment that i found out that it generates duplicate numbers sooner than required. So i guess this rules out the Mersenne Twister entirely.
UPDATE 6
I tested another suggested technique that looks promising, Skip32, and while i really liked the quality of the random numbers, the algorithm is not suitable for generating the entire range of integers in an acceptable amount of time. So unfortunately it falls short when compared to the other techniques that were able to finish the process. I used the implementation in C# from here, by the way - i changed the code to reduce the number of rounds to 1 but it still can't finish in a timely manner.
After all, judging by the results described above, my personal choice for the solution goes to the modular multiplicative inverses technique, followed very closely by the linear congruential generator. Some may argue that this is inferior in certain aspects to other techniques, but given my original constraints i'd say it fits them best.
If you don't need the random numbers to be cryptographically secure, you can use a Linear Congruential Generator.
An LCG is a formula of the form X_n+1 = X_n * a + c (mod m), it needs constant memory and constant time for every generated number.
If proper values for the LCG are chosen, it will have a full period length, meaning it will output every number between 0 and your chosen modulus.
An LCG has a full period if and only if:
The modulus and the increment are relatively prime, i.e. GCD(m, c) = 1
a - 1 is divisible by all prime factors of m
If m is divisible by 4, a - 1 must be divisible by 4.
Our modulus is 2 ^ 32, meaning a must be a number of form 4k + 1 where k is an arbitrary integer, and c must not be divisible by 2.
While this is a C# question I've coded a small C++ program to test the speed of this solution, since I'm more comfortable in that language:
#include <iostream>
#include <stdlib.h>
class lcg {
private:
unsigned a, c, val;
public:
lcg(unsigned seed=0) : lcg(seed, rand() * 4 + 1, rand() * 2 + 1) {}
lcg(unsigned seed, unsigned a, unsigned c) {
val = seed;
this->a = a;
this->c = c;
std::cout << "Initiated LCG with seed " << seed << "; a = " << a << "; c = " << c << std::endl;
}
unsigned next() {
this->val = a * this->val + c;
return this->val;
}
};
int main() {
srand(time(NULL));
unsigned seed = rand();
int dummy = 0;
lcg gen(seed);
time_t t = time(NULL);
for (uint64_t i = 0; i < 0x100000000ULL; i++) {
if (gen.next() < 1000) dummy++; // Avoid optimizing this out with -O2
}
std::cout << "Finished cycling through. Took " << (time(NULL) - t) << " seconds." << std::endl;
if (dummy > 0) return 0;
return 1;
}
You may notice I am not using the modulus operation anywhere in the lcg class, that's because we use 32 bit integer overflow for our modulus operation.
This produces all values in the range [0, 4294967295] inclusive.
I also had to add a dummy variable for the compiler not to optimize everything out.
With no optimization this solution finishes in about 15 seconds, while with -O2, a moderate optimization it finishes under 5 seconds.
If "true" randomness is not an issue, this is a very fast solution.
Is there a way in .NET
Actually, this can be done in most any language
to generate a sequence of all the 32-bit integers (Int32)
Yes.
in random order,
Here we need to agree on terminology since "random" is not what most people think it is. More on this in a moment.
without repetitions,
Yes.
and in a memory-efficient manner?
Yes.
Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Ok, so would using almost no memory be acceptable? ;-)
Before getting to the suggestion, we need to clear up the matter of "randomness". Something that is truly random has no discernible pattern. Hence, running the algorithm millions of times in a row could theoretically return the same value across all iterations. If you throw in the concept of "must be different from the prior iteration", then it is no longer random. However, looking at all of the requirements together, it seems that all that is really being asked for is "differing patterns of distribution of the integers". And this is doable.
So how to do this efficiently? Make use of Modular multiplicative inverses. I used this to answer the following Question which had a similar requirement to generate non-repeating, pseudo-random, sample data within certain bounds:
Generate different random time in the given interval
I first learned about this concept here ( generate seemingly random unique numeric ID in SQL Server ) and you can use either of the following online calculators to determine your "Integer" and "Modular Multiplicative Inverses (MMI)" values:
http://planetcalc.com/3311/
http://www.cs.princeton.edu/~dsri/modular-inversion-answer.php
Applying that concept here, you would use Int32.MaxSize as the Modulo value.
This would give a definite appearance of random distribution with no chance for collisions and no memory needed to store already used values.
The only initial problem is that the pattern of distribution is always the same given the same "Integer" and "MMI" values. So, you could come up with differing patterns by either adding a "randomly" generated Int to the starting value (as I believe I did in my answer about generating the sample data in SQL Server) or you can pre-generate several combinations of "Integer" and corresponding "MMI" values, store those in a config file / dictionary, and use a .NET random function to pick one at the start of each run. Even if you store 100 combinations, that is almost no memory use (assuming it is not in a config file). In fact, if storing both as Int and the dictionary uses Int as an index, then 1000 values is approximately 12k?
UPDATE
Notes:
There is a pattern in the results, but it is not discernible unless you have enough of them at any given moment to look at in total. For most use-cases, this is acceptable since no recipient of the values would have a large collection of them, or know that they were assigned in sequence without any gaps (and that knowledge is required in order to determine if there is a pattern).
Only 1 of the two variable values -- "Integer" and "Modular Multiplicative Inverse (MMI)" -- is needed in the formula for a particular run. Hence:
each pair gives two distinct sequences
if maintaining a set in memory, only a simple array is needed, and assuming that the array index is merely an offset in memory from the base address of the array, then the memory required should only be 4 bytes * capacity (i.e. 1024 options is only 4k, right?)
Here is some test code. It is written in T-SQL for Microsoft SQL Server since that is where I work primarily, and it also has the advantage of making it real easy-like to test for uniqueness, min and max values, etc, without needing to compile anything. The syntax will work in SQL Server 2008 or newer. For SQL Server 2005, initialization of variables had not been introduced yet so each DECLARE that contains an = would merely need to be separated into the DECLARE by itself and a SET #Variable = ... for however that variable is being initialized. And the SET #Index += 1; would need to become SET #Index = #Index + 1;.
The test code will error if you supply values that produce any duplicates. And the final query indicates if there are any gaps since it can be inferred that if the table variable population did not error (hence no duplicates), and the total number of values is the expected number, then there could only be gaps (i.e. missing values) IF either or both of the actual MIN and MAX values are outside of the expected values.
PLEASE NOTE that this test code does not imply that any of the values are pre-generated or need to be stored. The code only stores the values in order to test for uniqueness and min / max values. In practice, all that is needed is the simple formula, and all that is needed to pass into it is:
the capacity (though that could also be hard-coded in this case)
the MMI / Integer value
the current "index"
So you only need to maintain 2 - 3 simple values.
DECLARE #TotalCapacity INT = 30; -- Modulo; -5 to +4 = 10 OR Int32.MinValue
-- to Int32.MaxValue = (UInt32.MaxValue + 1)
DECLARE #MMI INT = 7; -- Modular Multiplicative Inverse (MMI) or
-- Integer (derived from #TotalCapacity)
DECLARE #Offset INT = 0; -- needs to stay at 0 if min and max values are hard-set
-----------
DECLARE #Index INT = (1 + #Offset); -- start
DECLARE #EnsureUnique TABLE ([OrderNum] INT NOT NULL IDENTITY(1, 1),
[Value] INT NOT NULL UNIQUE);
SET NOCOUNT ON;
BEGIN TRY
WHILE (#Index < (#TotalCapacity + 1 + #Offset)) -- range + 1
BEGIN
INSERT INTO #EnsureUnique ([Value]) VALUES (
((#Index * #MMI) % #TotalCapacity) - (#TotalCapacity / 2) + #Offset
);
SET #Index += 1;
END;
END TRY
BEGIN CATCH
DECLARE #Error NVARCHAR(4000) = ERROR_MESSAGE();
RAISERROR(#Error, 16, 1);
RETURN;
END CATCH;
SELECT * FROM #EnsureUnique ORDER BY [OrderNum] ASC;
SELECT COUNT(*) AS [TotalValues],
#TotalCapacity AS [ExpectedCapacity],
MIN([Value]) AS [MinValue],
(#TotalCapacity / -2) AS [ExpectedMinValue],
MAX([Value]) AS [MaxValue],
(#TotalCapacity / 2) - 1 AS [ExpectedMaxValue]
FROM #EnsureUnique;
A 32 bit PRP in CTR mode seems like the only viable approach to me (your 4th variant).
You can either
Use a dedicated 32 bit block cipher.
Skip32, the 32 bit variant of Skipjack is a popular choice.
As a tradeoff between quality/security and performance you can adjust the number of rounds to your needs. More rounds are slower but more secure.
Length-preserving-encryption (a special case of format-preserving-encryption)
FFX mode is the typical recommendation. But in its typical instantiations (e.g. using AES as underlying cipher) it'll be much slower than dedicated 32 bit block ciphers.
Note that many of these constructions have a significant flaw: They're even permutations. That means that once you have seen 2^32-2 outputs, you'll be able to predict the second-to-last output with certainty, instead of only 50%. I think Rogaways AEZ paper mentions a way to fix this flaw.
I'm going to preface this answer by saying I realize that some of the other answers are infinitely more elegant, and probably fit your needs better than this one. This is most certainly a brute-force approach to this problem.
If getting something truly random* (or pseudo-random* enough for cryptographic purposes) is important, you could generate a list of all integers ahead of time, and store them all on disk in random order ahead of time. At the run time of your program, you then read those numbers from the disk.
Below is the basic outline of the algorithm I'm proposing to generate these numbers. All 32-bit integers can be stored in ~16 GiB of disk space (32 bits = 4 bytes, 4 bytes / integer * 2^32 integers = 2^34 bytes = 16 GiB, plus whatever overhead the OS/filesystem needs), and I've taken "a few hundred megabytes" to mean that you want to read in a file of no more than 256 MiB at a time.
Generate 16 GiB / 256 MiB = 64 ASCII text files with 256 MiB of "null" characters (all bits set to 0) each. Name each text file "0.txt" through "64.txt"
Loop sequentially from Int32.MinValue to Int32.MaxValue, skipping 0. This is the value of the integer you're currently storing.
On each iteration, generate a random integer from 0 to UInt32.MaxValue from the source of randomness of your choice (hardware true random generator, pseudo-random algorithm, whatever). This is the index of the value you're currently storing.
Split the index into two integers: the 6 most significant bits, and the remaining 26. Use the upper bits to load the corresponding text file.
Multiply the lower 26 bits by 4 and use that as an index in the opened file. If the four bytes following that index are all still the "null" character, encode the current value into four ASCII characters, and store those characters in that position. If they are not all the "null" character, go back to step 3.
Repeat until all integers have been stored.
This would ensure that the numbers are from a known source of randomness but are still unique, rather than having the limitations of some of the other proposed solutions. It would take a long time to "compile" (especially using the relatively naive algorithm above), but it meets the runtime efficiency requirements.
At runtime, you can now generate a random starting index, then read the bytes in the files sequentially to obtain a unique, random*, non-repeating sequence of integers. Assuming that you're using a relatively small number of integers at once, you could even index randomly into the files, storing which indices you've used and ensuring a number is not repeated that way.
(* I understand that the randomness of any source is lessened by imposing the "uniqueness" constraint, but this approach should produce numbers relatively close in randomness to the original source)
TL;DR - Shuffle the integers ahead of time, store all of them on disk in a number of smaller files, then read from the files as needed at runtime.
Nice puzzle. A few things come to mind:
We need to store which items have been used. If approximately is good enough, you might want to use a bloom filter for this. But since you specifically state that you want all numbers, there's only one data structure for this: a bit vector.
You probably want to use a pseudo random generator algorithm with a long period.
And the solution probably involves using multiple algorithm.
My first attempt was to figure out how good pseudo random number generation works with a simple bit vector. I accept collisions (and therefore a slowdown), but definitely not too many collisions. This simple algorithm will generate about half the numbers for you in a limited amount of time.
static ulong xorshift64star(ulong x)
{
x ^= x >> 12; // a
x ^= x << 25; // b
x ^= x >> 27; // c
return x * 2685821657736338717ul;
}
static void Main(string[] args)
{
byte[] buf = new byte[512 * 1024 * 1024];
Random rnd = new Random();
ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
long collisions = 0;
Stopwatch sw = Stopwatch.StartNew();
for (long i = 0; i < uint.MaxValue; ++i)
{
if ((i % 1000000) == 0)
{
Console.WriteLine("{0} random in {1:0.00}s (c={2})", i, sw.Elapsed.TotalSeconds, collisions - 1000000);
collisions = 0;
}
uint randomValue; // result will be stored here
bool collision;
do
{
value = xorshift64star(value);
randomValue = (uint)value;
collision = (buf[randomValue >> 4] & (1 << (int)(randomValue & 7))) != 0;
++collisions;
}
while (collision);
buf[randomValue >> 4] |= (byte)(1 << (int)(randomValue & 7));
}
Console.ReadLine();
}
After about 1,9 billion random numbers, the algorithm will start to come to a grinding halt.
1953000000 random in 283.74s (c=10005932)
[...]
2108000000 random in 430.66s (c=52837678)
So, let's for the sake of argument say that you're going to use this algorithm for the first +/- 2 billion numbers.
Next, you need a solution for the rest, which is basically the problem that the OP described. For that, I'd sample random numbers into a buffer and combine the buffer with the Knuth shuffle algorithm. You can also use this right from the start if you like.
This is what I came up with (probably still buggy so do test...):
static void Main(string[] args)
{
Random rnd = new Random();
byte[] bloom = new byte[512 * 1024 * 1024];
uint[] randomBuffer = new uint[1024 * 1024];
ulong value = (uint)rnd.Next(int.MinValue, int.MaxValue);
long collisions = 0;
Stopwatch sw = Stopwatch.StartNew();
int n = 0;
for (long i = 0; i < uint.MaxValue; i += n)
{
// Rebuild the buffer. We know that we have uint.MaxValue-i entries left and that we have a
// buffer of 1M size. Let's calculate the chance that you want any available number in your
// buffer, which is now:
double total = uint.MaxValue - i;
double prob = ((double)randomBuffer.Length) / total;
if (i >= uint.MaxValue - randomBuffer.Length)
{
prob = 1; // always a match.
}
uint threshold = (uint)(prob * uint.MaxValue);
n = 0;
for (long j = 0; j < uint.MaxValue && n < randomBuffer.Length; ++j)
{
// is it available? Let's shift so we get '0' (unavailable) or '1' (available)
int available = 1 ^ ((bloom[j >> 4] >> (int)(j & 7)) & 1);
// use the xorshift algorithm to generate a random value:
value = xorshift64star(value);
// roll a die for this number. If we match the probability check, add it.
if (((uint)value) <= threshold * available)
{
// Store this in the buffer
randomBuffer[n++] = (uint)j;
// Ensure we don't encounter this thing again in the future
bloom[j >> 4] |= (byte)(1 << (int)(j & 7));
}
}
// Our buffer now has N random values, ready to be emitted. However, it's
// still sorted, which is something we don't want.
for (int j = 0; j < n; ++j)
{
// Grab index to swap. We can do this with Xorshift, but I didn't bother.
int index = rnd.Next(j, n);
// Swap
var tmp = randomBuffer[j];
randomBuffer[j] = randomBuffer[index];
randomBuffer[index] = tmp;
}
for (int j = 0; j < n; ++j)
{
uint randomNumber = randomBuffer[j];
// Do something with random number buffer[i]
}
Console.WriteLine("{0} random in {1:0.00}s", i, sw.Elapsed.TotalSeconds);
}
Console.ReadLine();
}
Back to the requirements:
Is there a way in .NET to generate a sequence of all the 32-bit integers (Int32) in random order, without repetitions, and in a memory-efficient manner? Memory-efficient would mean using a maximum of just a few hundred mega bytes of main memory.
Cost: 512 MB + 4 MB.
Repetitions: none.
It's pretty fast. It just isn't 'uniformly' fast. Every 1 million numbers, you have to recalculate the buffer.
What's also nice: both algorithms can work together, so you can first generate the first -say- 2 billion numbers very fast, and then use the second algorithm for the rest.
One of the easiest solutions is to use an block encrytion algorithm like AES in countermode. You need a seed which equals the key in AES. Next you need a counter which is incremented for each new random value. The random value is the result of encrypting the counter with the key. Since the cleartext (counter) and the random number (ciphertext) is bijectiv and because of the pigeon hole principle the random numbers are unique (for the blocksize).
Memory efficiency: you only need to store the seed and the counter.
The only limmitation is that AES has 128 bit block size instead of your 32 bit. So you might need to increase to 128 bit or find a block cipher with 32 bit block size.
For your IEnumerable you can write a wrapper. The index is the counter.
Disclaimer: You are asking for non-repeating/unique: This disqualifies from random because normally you should see collisions in random numbers. Therefore you should not use it for a long sequence. See also https://crypto.stackexchange.com/questions/25759/how-can-a-block-cipher-in-counter-mode-be-a-reasonable-prng-when-its-a-prp
As your numbers per your definition are supposed to be random then there is by definition no other way than to store all of then as the number have no intrinsic relation to each other.
So this would mean that you have to store all values you used in order to prevent them from being used again.
However, in computing the pattern just needs to be not "noticable". Usually the system calculates a random number by performing multiplication operations with huge predetermined values and timer values in such a way that they overflow and thus appear randomly selected. So either you use your third option or you have to think about generating these pseudo random numbers in a way that you can reproduce the sequence of every number generated and check if something in reoccuring. This obviously would be extremely computationally expensive but you asked for memory efficiency.
So you could store the number you seeded you random generator with and the number of elements you generated. Each time you need a new number, reseed the generator and iterate through the number of elements you generated + 1. This is your new number. Now reseed and iterate through the sequence again to check if it occured before.
So something like this:
int seed = 123;
Int64 counter = 0;
Random rnd = new Random(seed);
int GetUniqueRandom()
{
int newNumber = rnd.Next();
Random rndCheck = new Random(seed);
counter++;
for (int j = 0; j < counter; j++)
{
int checkNumber = rndCheck.Next();
if (checkNumber == newNumber)
return GetUniqueRandom();
}
return newNumber;
}
EDIT: It was pointed out that counter will reach a huge value and there's no telling if it will overflow before you got all of the 4 billion values or not.
You could try this homebrew block-cipher:
public static uint Random(uint[] seed, uint m)
{
for(int i = 0; i < seed.Length; i++)
{
m *= 0x6a09e667;
m ^= seed[i];
m += m << 16;
m ^= m >> 16;
}
return m;
}
Test code:
const int seedSize = 3; // larger values result in higher quality but are slower
var seed = new uint[seedSize];
var seedBytes = new byte[4 * seed.Length];
new RNGCryptoServiceProvider().GetBytes(seedBytes);
Buffer.BlockCopy(seedBytes, 0, seed, 0, seedBytes.Length);
for(uint i = 0; i < uint.MaxValue; i++)
{
Random(seed, i);
}
I haven't checked the quality of its outputs yet. Runs in 19 sec on my computer for seedSize = 3.
I've recently been looking into how BitConverter works and from reading other SO questions I've read that it takes a 'shortcut' when the start index is divisible by the size of the type being converted to where it can just cast a pointer the byte at the index into a pointer to the type being converted to and de-reference it.
Source for ToInt16 as an example:
public static unsafe short ToInt16(byte[] value, int startIndex) {
if( value == null) {
ThrowHelper.ThrowArgumentNullException(ExceptionArgument.value);
}
if ((uint) startIndex >= value.Length) {
ThrowHelper.ThrowArgumentOutOfRangeException(ExceptionArgument.startIndex, ExceptionResource.ArgumentOutOfRange_Index);
}
if (startIndex > value.Length -2) {
ThrowHelper.ThrowArgumentException(ExceptionResource.Arg_ArrayPlusOffTooSmall);
}
Contract.EndContractBlock();
fixed( byte * pbyte = &value[startIndex]) {
if( startIndex % 2 == 0) { // data is aligned
return *((short *) pbyte);
}
else {
if( IsLittleEndian) {
return (short)((*pbyte) | (*(pbyte + 1) << 8)) ;
}
else {
return (short)((*pbyte << 8) | (*(pbyte + 1)));
}
}
}
}
My question is why does this work regardless of the endianness of the machine, and why doesn't it use the same mechanism when the data is not aligned?
An example to clarify:
I have some bytes in buffer that I know are in Big endian format, and I want to read a short value from the array at say, index 5. I also assume that my machine, since it is Windows, uses little endian.
I would use BitConverter like so, by switching the order of my bytes to little endian:
BitConverter.ToInt16(new byte[] { buffer[6], buffer[5] })
assuming the code takes the shortcut it would do what I want: just cast the bytes as they are in the order provided and return the value. But if it didn't have that shortcut code, wouldn't it then reverse the byte order again and give me the wrong value? Or if I instead did:
BitConverter.ToInt16(new byte[] { 0, buffer[6], buffer[5] }, 1)
wouldn't it give me the wrong value since the index is not divisible by 2?
Another situation:
Say I had an array of bytes that contained an short somewhere I want to extract already in little endian format, but starting at an odd offset. Woulnd't the call to BitConverter reverse the order of the bytes since BitConverter.IsLittleEndian is true and the index is not aligned, thus giving me an incorrect value?
The code avoids a hardware exception on processors that don't allow misaligned data access, a bus error. Which is very expensive, it is usually resolved by kernel code that splits up the bus accesses and glues the bytes together. Such processors were still pretty common around the time that this code was written, the tail-end of the popularity of RISC designs like MIPS. Older ARM cores and Itanium are other examples, .NET versions have been released for all of them.
It makes little difference on processors that don't have a problem with it, like Intel/AMD cores. Memory is slow.
The code uses IsLittleEndian simply because it is indexing the individual bytes. Which of course makes the byte order matter.
On most architecutre there is a performance hit in accessing data that isn't aligned at the proper boundary. On x86 the CPU will allow you to read from an unaligned address, but there will be a performance hit. On some architecture you'll get a CPU fault that the operating system will trap.
I'd guess that the cost of letting the CPU fix up the reading of unaligned data is greater than the cost of reading the individual bytes and doing the shift/or operations. Also, the code is now portable to platforms where an unaligned read will cause a fault.
Why does this work regardless of the endianness of the machine?
The method does a re-interpretation of bytes assuming that they were produced in the environment with the same endianness. In other words, endianness influences both the the order the input bytes are arranged in the array, and the order the bytes need to be arranged in the output short in the same way.
Why doesn't it use the same mechanism when the machine is Big Endian?
This is an excellent observation, and it is not immediately obvious why the authors didn't do the cast. I think the reason behind it is that if you cast a pbyte with an odd value to short*, the subsequent access to short would be unaligned. This requires a special opcode to prevent a hard exception, which some platforms generate on unaligned access.
Okay, this problem is indeed a challenge!
Background
I am working on an arithmetic-based project involving larger than normal numbers. I new I was going to be working with a worst-case-senario of 4 GB-caped file sizes (I was hopping to even extend that to a 5GB cap as I have seen file sizes greater than 4 GB before - specifically image *.iso files)
The Question In General
Now, the algorithm(s) to which I will apply computation to do not matter at the moment, but the loading and handling of such large quantities of data - the numbers - do.
A System.IO.File.ReadAllBytes(String) can only read a cap of 2 GB worth of file data, so this is my first problem - How will I go about loading and/or configuring to access memory, such file sizes - twice as much, if not more?
Next, I was writing my own class to treat a 'stream' or array of bytes as a big number and add multiple operator methods to perform hex arithmetic until I read about the System.Numerics.BigInteger() class online - but being that there is no BigInteger.MaxValue and that I can only load a max of 2 GB of data at a time, I don't know what the potential of BigInteger would be - even compared to the object I was writing called Number() (which does have my desired minimum potential). There were also issues with available memory and performance, though I do not care so much about speed, but rather completing this experimental process successfully.
Summary
How should I load 4-5 gigabytes of data?
How should I store and handle the data after having been loaded? Stick with BigInteger or finish my own Number class?
How should I handle such large quantities of memory during runtime without running out of memory? I'll be treating the 4-5 GB of data like any other number instead of an array of bytes - performing such arithmetic as division and multiplication.
PS I cannot reveal too much information about this project under a non-discloser agreement. ;)
For those who would like to see a sample operator from my Number object for a per-byte array adder(C#):
public static Number operator +(Number n1, Number n2)
{
// GB5_ARRAY is a cap constant for 5 GB - 5368709120L
byte[] data = new byte[GB5_ARRAY];
byte rem = 0x00, bA, bB, rm, dt;
// Iterate through all bytes until the second to last
// The last byte is the remainder if any
// I tested this algorithm on smaller arrays provided by the `BitConverter` class,
// then I made a few tweeks to satisfy the larger arrays and the Number object
for (long iDx = 0; iDx <= GB5_ARRAY-1; iDx++)
{
// bData is a byte[] with GB5_ARRAY number of bytes
// Perform a check - solves for unequal (or jagged) arrays
if (iDx < GB5_ARRAY - 1) { bA = n1.bData[iDx]; bB = n2.bData[iDx]; } else { bA = 0x00; bB = 0x00; }
Add(bA, bB, rem, out dt, out rm);
// set data and prepare for the next interval
rem = rm; data[iDx] = dt;
}
return new Number(data);
}
private static void Add(byte a, byte b, byte r, out byte result, out byte remainder)
{
int i = a + b + r;
result = (byte)(i % 256); // find the byte amount through modulus arithmetic
remainder = (byte)((i - result) / 256); // find remainder
}
Normally, you would process large files using a streaming API, either raw binary (Stream), or via some protocol-reader (XmlReader, StreamReader, etc). This also could be done via memory-mapped files in some cases. The key point here is that you only look at a small portion of the file at a time (a moderately-sized buffer of data, a logical "line", or "node", etc - depending on the scenario).
Where this gets odd is your desire to map this somehow directly to some form of large number. Frankly, I don't know how we can help with that without more information, but if you are dealing with an actual number of this size, I think you're going to struggle unless the binary protocol makes that convenient. And "performing such arithmetic as division and multiplication" is meaningless on raw data; that only makes sense on parsed data with custom operations defined.
Also: note that in .NET 4.5 you can flip a configuration switch to expand the maximum size of arrays, going over the 2GB limit. It still has a limit, but: it is a bit bigger. Unfortunately, the maximum number of elements is still the same, so if you are using a byte[] array it won't help. But if you are using SomeCompositeStruct[] you should be able to get higher usage. See gcAllowVeryLargeObjects
Use FileStream: http://msdn.microsoft.com/en-us/library/system.io.filestream.aspx
FileStream is the beginning for you.
If you don't have enough memory (it should be at least 4x more than max your number size I think) you will need to use hard disk. So instead having all data in memory you would rather load part of data, do some computing and write it back to hard disk.
Just to avoid inventing hot-water, I am asking here...
I have an application with lots of arrays, and it is running out of memory.
So the thought is to compress the List<int> to something else, that would have same interface (IList<T> for example), but instead of int I could use shorter integers.
For example, if my value range is 0 - 100.000.000 I need only ln2(1000000) = 20 bits. So instead of storing 32 bits, I can trim the excess and reduce memory requirements by 12/32 = 37.5%.
Do you know of an implementation of such array. c++ and java would be also OK, since I could easily convert them to c#.
Additional requirements (since everyone is starting to getting me OUT of the idea):
integers in the list ARE unique
they have no special property so they aren't compressible in any other way then reducing the bit count
if the value range is one million for example, lists would be from 2 to 1000 elements in size, but there will be plenty of them, so no BitSets
new data container should behave like re-sizable array (regarding method O()-ness)
EDIT:
Please don't tell me NOT to do it. The requirement for this is well thought-over, and it is the ONLY option that is left.
Also, 1M of value range and 20 bit for it is ONLY AN EXAMPLE. I have cases with all different ranges and integer sizes.
Also, I could have even shorter integers, for example 7 bit integers, then packing would be
00000001
11111122
22222333
33334444
444.....
for first 4 elements, packed into 5 bytes.
Almost done coding it - will be posted soon...
Since you can only allocate memory in byte quantums, you are essentially asking if/how you can fit the integers in 3 bytes instead of 4 (but see #3 below). This is not a good idea.
Since there is no 3-byte sized integer type, you would need to use something else (e.g. an opaque 3-byte buffer) in its place. This would require that you wrap all access to the contents of the list in code that performs the conversion so that you can still put "ints" in and pull "ints" out.
Depending on both the architecture and the memory allocator, requesting 3-byte chunks might not affect the memory footprint of your program at all (it might simply litter your heap with unusable 1-byte "holes").
Reimplementing the list from scratch to work with an opaque byte array as its backing store would avoid the two previous issues (and it can also let you squeeze every last bit of memory instead of just whole bytes), but it's a tall order and quite prone to error.
You might want instead to try something like:
Not keeping all this data in memory at the same time. At 4 bytes per int, you 'd need to reach hundreds of millions of integers before memory runs out. Why do you need all of them at the same time?
Compressing the dataset by not storing duplicates if possible. There are bound to be a few of them if you are up to hundreds of millions.
Changing your data structure so that it stores differences between successive values (deltas), if that is possible. This might be not very hard to achieve, but you can only realistically expect something at the ballpark of 50% improvement (which may not be enough) and it will totally destroy your ability to index into the list in constant time.
One option that will get your from 32 bits to 24bits is to create a custom struct that stores an integer inside of 3 bytes:
public struct Entry {
byte b1; // low
byte b2; // middle
byte b3; // high
public void Set(int x) {
b1 = (byte)x;
b2 = (byte)(x >> 8);
b3 = (byte)(x >> 16);
}
public int Get() {
return (b3 << 16) | (b2 << 8) | b1;
}
}
You can then just create a List<Entry>.
var list = new List<Entry>();
var e = new Entry();
e.Set(12312);
list.Add(e);
Console.WriteLine(list[0].Get()); // outputs 12312
This reminds me of base64 and similar kinds of binary-to-text encoding.
They take 8 bit bytes and do a bunch of bit-fiddling to pack them into 4-, 5-, or 6-bit printable characters.
This also reminds me of the Zork Standard Code for Information Interchange (ZSCII), which packs 3 letters into 2 bytes, where each letter occupies 5 bits.
It sounds like you want to taking a bunch of 10- or 20-bit integers and pack them into a buffer of 8-bit bytes.
The source code is available for many libraries that handle a packed array of single bits
(a
b
c
d
e).
Perhaps you could
(a) download that source code and modify the source (starting from some BitArray or other packed encoding), recompiling to create a new library that handles packing and unpacking 10- or 20-bit integers rather than single bits.
It may take less programming and testing time to
(b) write a library that, from the outside, appears to act just like (a), but internally it breaks up 20-bit integers into 20 separate bits, then stores them using an (unmodified) BitArray class.
Edit: Given that your integers are unique you could do the following: store unique integers until the number of integers you're storing is half the maximum number. Then switch to storing the integers you don't have. This will reduce the storage space by 50%.
Might be worth exploring other simplification techniques before trying to use 20-bit ints.
How do you treat duplicate integers? If have lots of duplicates you could reduce the storage size by storing the integers in a Dictionary<int, int> where keys are unique integers and values are corresponding counts. Note this assumes you don't care about the order of your integers.
Are your integers all unique? Perhaps you're storing lots of unique integers in the range 0 to 100 mil. In this case you could try storing the integers you don't have. Then when determining if you have an integer i just ask if it's not in your collection.