I'm looking for a PRNG (pseudo randomness) that you initially seed with an arbitrary array of bytes.
Heard of any?
Hashing your arbitrary length seed (instead of using XOR as paxdiablo suggested) will ensure that collisions are extremely unlikely, i.e. equal to the probability of a hash collision, with something such as SHA1/2 this is a practical impossibility.
You can then use your hashed seed as the input to a decent PRNG such as my favourite, the Mersenne Twister.
UPDATE
The Mersenne Twister implementation available here already seems to accept an arbitrary length key: http://code.msdn.microsoft.com/MersenneTwister/Release/ProjectReleases.aspx?ReleaseId=529
UPDATE 2
For an analysis of just how unlikely a SHA2 collision is see how hard someone would have to work to find one, quoting http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-2 :
There are two meet-in-the-middle preimage attacks against SHA-2 with a reduced number of rounds. The first one attacks 41-round SHA-256 out of 64 rounds with time complexity of 2^253.5 and space complexity of 2^16, and 46-round SHA-512 out of 80 rounds with time 2^511.5 and space 2^3. The second one attacks 42-round SHA-256 with time complexity of 2^251.7 and space complexity of 2^12, and 42-round SHA-512 with time 2^502 and space 2^22.
Why don't you just XOR your arbitrary sequence into a type of the right length (padding it with part of itself if necessary)? For example, if you want the seed "paxdiablo" and your PRNG has a four-byte seed:
paxd 0x70617864
iabl 0x6961626c
opax 0x6f706178
----------
0x76707b70 or 0x707b7076 (Intel-endian).
I know that seed looks artificial (and it is since the key is chosen from alpha characters). If you really wanted to make it disparate where the phrase is likely to come from a similar range, XOR it again with a differentiator like 0xdeadbeef or 0xa55a1248:
paxd 0x70617864 0x70617864
iabl 0x6961626c 0x6961626c
opax 0x6f706178 0x6f706178
0xdeadbeef 0xa55a1248
---------- ----------
0xa8ddc59f 0xd32a6938
I prefer the second one since it will more readily move similar bytes into disparate ranges (the upper bits of the bytes in the differentiator are disparate).
Related
Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.
You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.
You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET
I'm attempting to write a method to generate an integer based on any given string. When calling this method on 2 identical strings, I need the method to generate the same exact integer both times.
I tried using .GetHasCode() however this is very unreliable once I move the project to another machine, as GetHasCode() returns different values for the same string
It is also important that the collision rate be VERY low. Custom methods I have written thus far produce collisions after just a few hundred thousand records.
The hash value MUST be an integer. A string hash value (like md5) would cripple my project in terms of speed and loading overhead.
The integer hashes are being used to perform extremely rapid text searches, which I have working beautifully, however it currently relies on .GetHasCode() and doesn't work when multiple machines get involved.
Any insight at all would be greatly appreciated.
MD5 hashing returns a byte array which could be converted to an integer:
var mystring = "abcd";
MD5 md5Hasher = MD5.Create();
var hashed = md5Hasher.ComputeHash(Encoding.UTF8.GetBytes(mystring));
var ivalue = BitConverter.ToInt32(hashed, 0);
Of course, you are converting from a 128 bit hash to a 32 bit int, so some information is being lost which will increase the possibility of collisions. You could try adjusting the second parameter to ToInt32 to see if any specific ranges of the MD5 hash produce fewer collisions than others for your data.
If your hash code creates duplicates "after a few hundred thousand records," you have a pretty good hash code implementation.
If you do the math, you'll find that a 32-bit hash code has a 50% chance of creating a duplicate after about 70,000 records. The probability of generating a duplicate after a million records is so close to certainty as not to matter.
As a rule of thumb, the likelihood of generating a duplicate hash code is 50% when the number of records hashed is equal to the square root of the number of possible values. So with a 32 bit hash code that has 2^32 possible values, the chance of generating a duplicate is 50% after approximately 2^16 (65,536) values. The actual number is slightly larger--closer to 70,000--but the rule of thumb gets you in the ballpark.
Another rule of thumb is that the chance of generating a duplicate is nearly 100% when the number of items hashed is four times the square root. So with a 32-bit hash code you're almost guaranteed to get a collision after only 2^18 (262,144) records hashed.
That's not going to change if you use the MD5 and convert it from 128 bits to 32 bits.
This code map any string to int between 0-100
int x= "ali".ToCharArray().Sum(x => x)%100;
using (MD5 md5 = MD5.Create())
{
bigInteger = new BigInteger(md5.ComputeHash(Encoding.Default.GetBytes(myString)));
}
BigInteger requires Org.BouncyCastle.Math
Since computers cannot pick random numbers(can they?) how is this random number actually generated. For example in C# we say,
Random.Next()
What happens inside?
You may checkout this article. According to the documentation the specific implementation used in .NET is based on Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. "The Art of Computer Programming, volume 2: Seminumerical Algorithms". Addison-Wesley, Reading, MA, second edition, 1981.
Since computers cannot pick random numbers (can they?)
As others have noted, "Random" is actually pseudo-random. To answer your parenthetical question: yes, computers can pick truly random numbers. Doing so is much more expensive than the simple integer arithmetic of a pseudo-random number generator, and usually not required. However there are applications where you must have non-predictable true randomness: cryptography and online poker immediately come to mind. If either use a predictable source of pseudo-randomness then attackers can decrypt/forge messages much more easily, and cheaters can figure out who has what in their hands.
The .NET crypto classes have methods that give random numbers suitable for cryptography or games where money is on the line. As for how they work: the literature on crypto-strength randomness is extensive; consult any good university undergrad textbook on cryptography for details.
Specialty hardware also exists to get random bits. If you need random numbers that are drawn from atmospheric noise, see www.random.org.
Knuth covers the topic of randomness very well.
We don't really understand random well. How can something predictable be random? And yet pseudo-random sequences can appear to be perfectly random by statistical tests.
There are three categories of Random generators, amplifying on the comment above.
First, you have pseudo random number generators where if you know the current random number, it's easy to compute the next one. This makes it easy to reverse engineer other numbers if you find out a few.
Then, there are cryptographic algorithms that make this much harder. I believe they still are pseudo random sequences (contrary to what the comment above implies), but with the very important property that knowing a few numbers in the sequence does NOT make it obvious how to compute the rest. The way it works is that crypto routines tend to hash up the number, so that if one bit changes, every bit is equally likely to change as a result.
Consider a simple modulo generator (similar to some implementations in C rand() )
int rand() {
return seed = seed * m + a;
}
if m=0 and a=0, this is a lousy generator with period 1: 0, 0, 0, 0, ....
if m=1 and a=1, it's also not very random looking: 0, 1, 2, 3, 4, 5, 6, ...
But if you pick m and a to be prime numbers around 2^16, this will jump around nicely looking very random if you are casually inspecting. But because both numbers are odd, you would see that the low bit would toggle, ie the number is alternately odd and even. Not a great random number generator. And since there are only 2^32 values in a 32 bit number, by definition after 2^32 iterations at most, you will repeat the sequence again, making it obvious that the generator is NOT random.
If you think of the middle bits as nice and scrambled, while the lower ones aren't as random, then you can construct a better random number generator out of a few of these, with the various bits XORed together so that all the bits are covered well. Something like:
(rand1() >> 8) ^ rand2() ^ (rand3() > 5) ...
Still, every number is flipping in synch, which makes this predictable. And if you get two sequential values they are correlated, so that if you plot them you will get lines on your screen. Now imagine you have rules combining the generators, so that sequential values are not the next ones.
For example
v1 = rand1() >> 8 ^ rand2() ...
v2 = rand2() >> 8 ^ rand5() ..
and imagine that the seeds don't always advance. Now you're starting to make something that's much harder to predict based on reverse engineering, and the sequence is longer.
For example, if you compute rand1() every time, but only advance the seed in rand2() every 3rd time, a generator combining them might not repeat for far longer than the period of either one.
Now imagine that you pump your (fairly predictable) modulo-type random number generator through DES or some other encryption algorithm. That will scramble up the bits.
Obviously, there are better algorithms, but this gives you an idea. Numerical Recipes has a lot of algorithms implemented in code and explained. One very good trick: generate not one but a block of random values in a table. Then use an independent random number generator to pick one of the generated numbers, generate a new one and replace it. This breaks up any correlation between adjacent pairs of numbers.
The third category is actual hardware-based random number generators, for example based on atmospheric noise
http://www.random.org/randomness/
This is, according to current science, truly random. Perhaps someday we will discover that it obeys some underlying rule, but currently, we cannot predict these values, and they are "truly" random as far as we are concerned.
The boost library has excellent C++ implementations of Fibonacci generators, the reigning kings of pseudo-random sequences if you want to see some source code.
I'll just add an answer to the first part of the question (the "can they?" part).h
Computers can generate (well, generate may not be an entirely accurate word) random numbers (as in, not pseudo-random). Specifically, by using environmental randomness which is gotten through specialized hardware devices (that generates randomness based on noise, for e.g.) or by using environmental inputs (e.g. hard disk timings, user input event timings).
However, that has no bearing on the second question (which was how Random.Next() works).
The Random class is a pseudo-random number generator.
It is basically an extremely long but deterministic repeating sequence. The "randomness" comes from starting at different positions. Specifying where to start is done by choosing a seed for the random number generator and can for example be done by using the system time or by getting a random seed from another random source. The default Random constructor uses the system time as a seed.
The actual algorithm used to generate the sequence of numbers is documented in MSDN:
The current implementation of the Random class is based on Donald E. Knuth's subtractive random number generator algorithm. For more information, see D. E. Knuth. "The Art of Computer Programming, volume 2: Seminumerical Algorithms". Addison-Wesley, Reading, MA, second edition, 1981.
Computers use pseudorandom number generators. Essentially, they work by start with a seed number and iterating it through an algorithm each time a new pseudorandom number is required.
The process is of course entirely deterministic, so a given seed will generate exactly the same sequence of numbers every time it is used, but the numbers generated form a statistically uniform distribution (approximately), and this is fine, since in most scenarios all you need is stochastic randomness.
The usual practice is to use the current system time as a seed, though if more security is required, "entropy" may be gathered from a physical source such as disk latency in order to generate a seed that is more difficult to predict. In this case, you'd also want to use a cryptographically strong random number generator such as this.
I don't know much details but what I know is that a seed is used in order to generate the random numbers it is then based on some algorithm that uses that seed that a new number is obtained.
If you get random numbers based on the same seed they will be the same often.
Given an unsigned number, what is a good (preferably fast) way of getting the number of leading 1-bits?
I need this to calculate the CIDR number from a quad-dotted IPv4 netmask.
Note: I have seen getting-the-number-of-trailing-1-bits, so I can do the table-based lookup.
Note 2: Even though I added a couple of language tags, for me it is about the algorithm, so if you have a good one in another language, please feel free to post.
Edit: on endian-ness.
I just found out that the INET_ADDR function and IN_ADDR structure store the IPv4 address in big-endian form, whereas the x86 is little-endian, and most of the processors I use are little-endian too.
So, lookup tables for this specific case are fast enough.
But thanks for the other answers: they work very nicely in the more common case.
--jeroen
If you invert all the bits, you can use a leading-one detection algorithm (equivalent to a log2 algorithm); see e.g. http://www-graphics.stanford.edu/~seander/bithacks.html. You may be able to modify some of those algorithms, and skip the inversion stage.
Unless you are specifically looking for theoretical improvements in number of steps (i.e. O(lg(n)) in number of bits), I don't think you will practically do much better than table lookup and still have an algorithm that's portable.
I'd expect to comfortably scan over 100 million IPv4 netmasks for leading ones per second using a simple table lookup on a single thread on a modern machine (excluding the work of actually getting hold of the IPv4 netmask). I'm not sure the gains from further algorithmic improvement would be worth any added complexity.
If all the bits are 0 before the 1s you can use the BSF assembler instruction, it will return the index of the first bit set (not zero) starting from the lower bit, then you can subtract it from 32 and obtain the number of bits set.
Otherwise if you have to look for from the high bit you can use NOT to invert the bits, and the use BSR to scan from the high bit to zero.
This question, though completely unrelated, shows some C code for doing CTZ/CLZ (Count Leading/Trailing Zeros). If you invert (bitwise not) your input before calling one of these you can get the number of leading/trailing ones.
I have a little problem where need to do a hash of a number of about 10 digits into a number of 6 digits. The hash needs to be deterministic.
It's more important that the hash is not resource intensive.
For example, say that I have some number, x, like 123456789
I want to write an hash function that gives me a number, y, back like 987654.
I'd then like to have a function that takes the x and y as parameters, re-applies the hash on x, and checks that the result is y.
It should be difficult to compute possible input values given the hash.
My first idea of multiplying pairs of digits led to a lot of duplicate hashed values.
I have the feeling that this sort of problem has some kind of elegant solution, but I just can't think of it myself.
Can anyone help me out here? Thanks in advance :)
What you need is called "hashing".
Try CRC16.
Your problem as stated is not solvable.
You say that you want the system to be "somewhat hard to break", by which I assume you mean that it is "somewhat hard" for an attacker to take a known digest and produce from it a possible input which hashes to the given digest. Since there are only 4 billion possible inputs and only 65536 possible hashes in the system you propose, it is utterly trivial to find a message that corresponds to a given hash, no matter what the hash algorithm is. On average, the attacker will have about 65000 possible messages to choose from, and can therefore cherry-pick the message that best serves his nefarious scheme.
I would expect a "somewhat hard" problem in the hash-breaking space to require, dedicating, say, a few million dollars worth of supercomputer time to break. Your proposal can be broken by inexperienced high school students writing Javascript programs that take a couple minutes to write and maybe a minute to run, tops; this is not even vaguely close to "somewhat hard".
Why are you choosing such tiny limits on your algorithm, limits which will by their very nature make it trivial to break the hashing? And for that matter, what's the value in hashing such a tiny amount of data as a 32 bit integer?
(( X>>16) ^ (X)) & 0xFFFF
.......
What you want to do is to try to distribute the hash values as evenly as possible over the range. Some of the built in hashing methods are fairly good at this, so you could perhaps try something like getting the hash code of the string representation, and simply throw away half of the bits:
ushort code = (ushort)value.ToString().GetHashCode();
However, it also depends on what you are going to use the hash code for. The built in hash codes are not intended to be stored permanently. The algorithms for calculating the hash codes can change with any new version of the framework, so if you store the hash codes in the database they may become useless in the future. In that case you would instead have to create the hashing algorithm yourself from scratch, or use some hashing algorithm that was designed for permanent storage.
One simple algorithm that is used for hash codes for some values in the framework is to use exclusive or to make all bits in the value matter when the hash code is smaller than the data:
byte[] b = BitConverter.GetBytes(value);
ushort code = (ushort)(BitConverter.ToUInt16(b, 0) ^ BitConverter.ToUInt16(b, 2));
or the more efficient but less obvious way to do the same:
ushort code = (ushort)((value >> 16) ^ value);
This of course has no obfuscating properties for small values, so you might want to throw in some "random" bits to make the hash code significantly different from the value:
ushort code = (ushort)(0x56D4 ^ (value >> 16) ^ value);
How about just discarding the lower 16 bits or last 4 digits?
1234567890 --> 123456
Easily done by just doing an integer division by 10000.