Given an unsigned number, what is a good (preferably fast) way of getting the number of leading 1-bits?
I need this to calculate the CIDR number from a quad-dotted IPv4 netmask.
Note: I have seen getting-the-number-of-trailing-1-bits, so I can do the table-based lookup.
Note 2: Even though I added a couple of language tags, for me it is about the algorithm, so if you have a good one in another language, please feel free to post.
Edit: on endian-ness.
I just found out that the INET_ADDR function and IN_ADDR structure store the IPv4 address in big-endian form, whereas the x86 is little-endian, and most of the processors I use are little-endian too.
So, lookup tables for this specific case are fast enough.
But thanks for the other answers: they work very nicely in the more common case.
--jeroen
If you invert all the bits, you can use a leading-one detection algorithm (equivalent to a log2 algorithm); see e.g. http://www-graphics.stanford.edu/~seander/bithacks.html. You may be able to modify some of those algorithms, and skip the inversion stage.
Unless you are specifically looking for theoretical improvements in number of steps (i.e. O(lg(n)) in number of bits), I don't think you will practically do much better than table lookup and still have an algorithm that's portable.
I'd expect to comfortably scan over 100 million IPv4 netmasks for leading ones per second using a simple table lookup on a single thread on a modern machine (excluding the work of actually getting hold of the IPv4 netmask). I'm not sure the gains from further algorithmic improvement would be worth any added complexity.
If all the bits are 0 before the 1s you can use the BSF assembler instruction, it will return the index of the first bit set (not zero) starting from the lower bit, then you can subtract it from 32 and obtain the number of bits set.
Otherwise if you have to look for from the high bit you can use NOT to invert the bits, and the use BSR to scan from the high bit to zero.
This question, though completely unrelated, shows some C code for doing CTZ/CLZ (Count Leading/Trailing Zeros). If you invert (bitwise not) your input before calling one of these you can get the number of leading/trailing ones.
Related
I noticed that hashcodes I got from other objects were different when I built for a either x86 or x64.
Up until now I have implemented most of my own hashing functions like this:
int someIntValueA;
int someIntValueB;
const int SHORT_MASK = 0xFFFF;
public override int GetHashCode()
{
return (someIntValueA & SHORT_MASK) + ((someIntValueB & SHORT_MASK) << 16);
}
Will storing the values in a long and getting the hashcode from that give me a wider range as well on 64-bit systems, or is this a bad idea?
public override int GetHashCode()
{
long maybeBiggerSpectrumPossible = someIntValueA + (someIntValueB << 32);
return maybeBiggerSpectrumPossible.GetHashCode();
}
No, that will be far worse.
Suppose your int values are typically in the range of a short: between -30000 and +30000. And suppose further that most of them are near the middle, say, between 0 and 1000. That's pretty typical. With your first hash code you get all the bits of both ints into the hash code and they don't interfere with each other; the number of collisions is zero under typical conditions.
But when you do your trick with a long, then you rely on what the long implementation of GetHashCode does, which is xor the upper 32 bits with the lower 32 bits. So your new implementation is just a slow way of writing int1 ^ int2. Which, in the typical scenario has almost all zero bits, and hence collisions all over the place.
The approach you suggest won't make anything any better (quite the opposite).
However…
SpookyHash is for example designed to work particularly quickly on 64-bit systems, because when working out the math the author was thinking about what would be fast on a 64-bit system, xxHash has 32-bit and 64-bit variants that are designed to give comparable quality of hash at better speed for 32-bit and 64-bit computation respectively.
The general idea of making use of the differences performances of different arithmetic operations on different machines is a valid one.
And your general idea of making use of a larger intermediary storage in hash calculation is also a valid one as long as those extra bits make their way into subsequent operations.
So at a very general level, the answer is yes, even if your particular implementation fails to come through with that.
Now, in practice, when you're sitting down to write a hashcode implementation should you worry about this?
Well it depends. For a while I was very bullish about using algorithms like SpookyHash, and it does very well (even on 32-bit systems) when the hash is based on a large amount of source data. But on the other hand it can be better, especially when used with smaller hash-based sets and dictionaries, to be crappy really fast than fantastic slowly. So there isn't an one-solution-fits-all answer. With just two input integers your initial solution is likely to beat a super-avalancy algorithm like xxHash or SpookyHash for many uses. You could perhaps do better if you also had a >> 16 to rotate rather than shift (fun fact, some jitters are optimised for that), but we're not touching on 64- vs 32-bit versions in that at all.
The cases where you do find a big possible improvement with taking a different approach in 64- and 32-bit are where there's a large amount of data to mix in, especially if it's in a blittable form (like string or byte[]) that you can access via a long* or int* depending on framework.
So, generally you can ignore the question of bitness, but if you find yourself thinking "this hashcode has to go through so much stuff to get an answer; can I make it better?" then maybe it's time to consider such matters.
Could somebody help me to understand what is the most significant byte of a 160 bit (SHA-1) hash?
I have a C# code which calls the cryptography library to calculate a hash code from a data stream. In the result I get a 20 byte C# array. Then I calculate another hash code from another data stream and then I need to place the hash codes in ascending order.
Now, I'm trying to understand how to compare them right. Apparently I need to subtract one from another and then check if the result is negative, positive or zero. Technically, I have 2 20 byte arrays, which if we look at from the memory perspective having the least significant byte at the beginning (lower memory address) and the most significant byte at the end (higher memory address). On the other hand looking at them from the human reading perspective the most significant byte is at the beginning and the least significant is at the end and if I'm not mistaken this order is used for comparing GUIDs. Of course, it will give us different order if we use one or another approach. Which way is considered to be the right or conventional one for comparing hash codes? It is especially important in our case because we are thinking about implementing a distributed hash table which should be compatible with existing ones.
You should think of the initial hash as just bytes, not a number. If you're trying to order them for indexed lookup, use whatever ordering is simplest to implement - there's no general purpose "right" or "conventional" here, really.
If you've got some specific hash table you want to be "compatible" with (not even sure what that would mean) you should see what approach to ordering that hash table takes, assuming it's even relevant. If you've got multiple tables you need to be compatible with, you may find you need to use different ordering for different tables.
Given the comments, you're trying to work with Kademlia, which based on this document treats the hashes as big-endian numbers:
Kademlia follows Pastry in interpreting keys (including nodeIDs) as bigendian numbers. This means that the low order byte in the byte array representing the key is the most significant byte and so if two keys are close together then the low order bytes in the distance array will be zero.
That's just an arbitrary interpretation of the bytes - so long as everyone uses the same interpretation, it will work... but it would work just as well if everyone decided to interpret them as little-endian numbers.
You can use SequenceEqual to compare Byte arrays, check the following links for elaborate details:
How to compare two arrays of bytes
Comparing two byte arrays in .NET
Okay so I'm trying to make a basic malware scanner in C# my question is say I have the Hex signature for a particular bit of code
For example
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\test.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c746573742e74787422293b
Gets Changed to -
{
System.IO.File.Delete(#"C:\Users\Public\DeleteTest\notatest.txt");
}
//Which will have a hex of 53797374656d2e494f2e46696c652e44656c657465284022433a5c55736572735c5075626c69635c44656c657465546573745c6e6f7461746573742e74787422293b
Keep in mind these bits will be within the entire Hex of the program - How could I go about taking my base signature and looking for partial matches that say have a 90% match therefore gets flagged.
I would do a wildcard but that wouldn't work for slightly more complex things where it might be coded slightly different but the majority would be the same. So is there a way I can do a percent match for a substring? I was looking into the Levenshtein Distance but I don't see how I'd apply it into this given scenario.
Thanks in advance for any input
Using an edit distance would be fine. You can take two strings and calculate the edit distance, which will be an integer value denoting how many operations are needed to take one string to the other. You set your own threshold based off that number.
For example, you may statically set that if the distance is less than five edits, the change is relevant.
You could also take the length of string you are comparing and take a percentage of that. Your example is 36 characters long, so (int)(input.Length * 0.88m) would be a valid threashold.
First, your program bits should match EXACTLY or else it has been modified or is corrupt. Generally, you will store an MD5 hash on the original binary and check the MD5 against new versions to see if they are 'the same enough' (MD5 can't guarantee a 100% match).
Beyond this, in order to detect malware in a random binary, you must know what sort of patterns to look for. For example, if I know a piece of malware injects code with some binary XYZ, I will look for XYZ in the bits of the executable. Patterns get much more complex than that, of course, as the malware bits can be spread out in chuncks. What is more interesting is that some viruses are self-morphing. This means that each time it runs, it modifies itself, meaning the scanner does not know an exact pattern to find. In these cases, the scanner must know the types of derivatives can be produced and look for all of them.
In terms of finding a % match, this operation is very time consuming unless you have constraints. By comparing 2 strings, you cannot tell which pieces were removed, added, or replaced. For instance, if I have a starting string 'ABCD', is 'AABCDD' a 100% match or less since content has been added? What about 'ABCDABCD'; here it matches twice. How about 'AXBXCXD'? What about 'CDAB'?
There are many DIFF tools in existence that can tell you what pieces of a file have been changed (which can lead to a %). Unfortunately, none of them are perfect because of the issues that I described above. You will find that you have false negatives, false positives, etc. This may be 'good enough' for you.
Before you can identify a specific algorithm that will work for you, you will have to decide what the restrictions of your search will be. Otherwise, your scan will be NP-hard, which leads to unreasonable running times (your scanner may run all day just to check one file).
I suggest you look into Levenshtein distance and Damerau-Levenshtein distance.
The former tells you how many add/delete operations are needed to turn one string into another; and the latter tells you how many add/delete/replace operations are needed to turn one string into another.
I use these quite a lot when writing programs where users can search for things, but they may not know the exact spelling.
There are code examples on both articles.
A SO post about generating all the permutations got me thinking about a few alternative approaches. I was thinking about using space/run-time trade offs and was wondering if people could critique this approach and possible hiccups while trying to implement it in C#.
The steps goes as follows:
Given a data-structure of homogeneous elements, count the number of elements in the structure.
Assuming the permutation consists of all the elements of the structure, calculate the factorial of the value from Step 1.
Instantiate a newer structure(Dictionary) of type <key(Somehashofcollection),Collection<data-structure of homogeneous elements>> and initialize a counter.
Hash(???) the seed structure from step 1, and insert the key/value pair of hash and collection into the Dictionary. Increment the counter by 1.
Randomly shuffle(???) the order of the seed structure, hash it and then try to insert it into the Dictionary from step 3.
If there is a conflict in hashes,repeat step 5 again to get a new order and hash and check for conflict. Upon successful insertion increment the counter by 1.
Repeat steps 5 & 6 until the counter equals the factorial calculated in step 2.
It seems like doing it this way using some sort of randomizer(which is a black box to me at the moment) might help with getting all the permutations within a decent timeframe for datasets of obscene sizes.
It will be great to get some feedback from the great minds of SO to further analyze this approach whose objective is to deviate from the traditional brute-force approach prevalent in algorithms of such nature and also the repercussions of implementing such an algorithm using C#.
Thanks
This method of generating all permutations does not fare well as compared to the standard known methods.
Say you had n items and M=n! permutations.
This method of generation is expected to generate M*lnM permutations before discovering all M.
(See this answer for a possible explanation: Programing Pearls - Random Select algorithm)
Also, what would the hash function be? For a reasonable hash function, we might have to start dealing with very large integer issues pretty soon (any n > 50 for sure, don't remember that exact cut-off point).
This method uses up a lot of memory too (the hashtable of all permutations).
Even assuming the hash is perfect, this method would take expected Omega(nMlogM) operations and guaranteed Omega(nM) space, while standard well-known methods can do it in O(M) time and O(n) space.
As a starting point I suggest one can read: Systematic Generation of All Permutations which is believe is O(nM) time and O(n) space and still much better than this method.
Note that if one has to generate all permutations, any algorithm will necessarily take Omega(M) steps and so the the method I refer to above is optimal!
It seems like a complicated way to randomise the order of the generated permutations. In terms of time efficiency, you can't do much better than the 'brute force' approach.
I'm looking for a PRNG (pseudo randomness) that you initially seed with an arbitrary array of bytes.
Heard of any?
Hashing your arbitrary length seed (instead of using XOR as paxdiablo suggested) will ensure that collisions are extremely unlikely, i.e. equal to the probability of a hash collision, with something such as SHA1/2 this is a practical impossibility.
You can then use your hashed seed as the input to a decent PRNG such as my favourite, the Mersenne Twister.
UPDATE
The Mersenne Twister implementation available here already seems to accept an arbitrary length key: http://code.msdn.microsoft.com/MersenneTwister/Release/ProjectReleases.aspx?ReleaseId=529
UPDATE 2
For an analysis of just how unlikely a SHA2 collision is see how hard someone would have to work to find one, quoting http://en.wikipedia.org/wiki/SHA_hash_functions#SHA-2 :
There are two meet-in-the-middle preimage attacks against SHA-2 with a reduced number of rounds. The first one attacks 41-round SHA-256 out of 64 rounds with time complexity of 2^253.5 and space complexity of 2^16, and 46-round SHA-512 out of 80 rounds with time 2^511.5 and space 2^3. The second one attacks 42-round SHA-256 with time complexity of 2^251.7 and space complexity of 2^12, and 42-round SHA-512 with time 2^502 and space 2^22.
Why don't you just XOR your arbitrary sequence into a type of the right length (padding it with part of itself if necessary)? For example, if you want the seed "paxdiablo" and your PRNG has a four-byte seed:
paxd 0x70617864
iabl 0x6961626c
opax 0x6f706178
----------
0x76707b70 or 0x707b7076 (Intel-endian).
I know that seed looks artificial (and it is since the key is chosen from alpha characters). If you really wanted to make it disparate where the phrase is likely to come from a similar range, XOR it again with a differentiator like 0xdeadbeef or 0xa55a1248:
paxd 0x70617864 0x70617864
iabl 0x6961626c 0x6961626c
opax 0x6f706178 0x6f706178
0xdeadbeef 0xa55a1248
---------- ----------
0xa8ddc59f 0xd32a6938
I prefer the second one since it will more readily move similar bytes into disparate ranges (the upper bits of the bytes in the differentiator are disparate).