Don't ask how I got there, but I was playing around with some masking, loop unrolling etc. In any case, out of interest I was thinking about how I would implement an indexof method, and long story short, all that masking etc aside, this naive implementation:
public static unsafe int IndexOf16(string s, int startIndex, char c) {
if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
fixed (char* cs = s) {
for (int i = startIndex; i < s.Length; i++) {
if ((cs[i]) == c) return i;
}
return -1;
}
}
is faster than string.IndexOf(char). I wrote some simple tests, and it seems to match output exactly.
Some sample output numbers from my machine (it varies to some degree of course, but the trend is clear):
short haystack 500k runs
1741 ms for IndexOf16
2737 ms for IndexOf32
2963 ms for IndexOf64
2337 ms for string.IndexOf <-- buildin
longer haystack:
2888 ms for IndexOf16
3028 ms for IndexOf32
2816 ms for IndexOf64
3353 ms for string.IndexOf <-- buildin
IndexOfChar is marked extern, so you cant reflector it. However I think this should be the (native) implementation:
http://www.koders.com/cpp/fidAB4768BA4DF45482A7A2AA6F39DE9C272B25B8FE.aspx?s=IndexOfChar#L1000
They seem to use the same naive implementation.
Questions come to my mind:
1) Am I missing something in my implementation that explains why its faster? I can only think of extended chars support, but their implementation suggests they don't do anything special for that either.
2) I assumed much of the low level methods would ultimately be implemented in hand assembler, that seems not the case. If so, why implement it natively at all, instead of just in C# like my sample implementation?
(Complete test here (I think its too long to paste here): http://paste2.org/p/1606018 )
(No this is not premature optimization, it's not for a project I am just messing about) :-)
Update: Thnx to Oliver for the hint about nullcheck and the Count param. I have added these to my IndexOf16Implementation like so:
public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
if (s == null) throw new ArgumentNullException("s");
if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
if (count == -1) count = s.Length - startIndex;
if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");
int endIndex = startIndex + count;
fixed (char* cs = s) {
for (int i = startIndex; i < endIndex; i++) {
if ((cs[i]) == c) return i;
}
return -1;
}
}
The numbers changed slightly, however it is still quite significantly faster (32/64 results omitted):
short haystack 500k runs
1908 ms for IndexOf16
2361 ms for string.IndexOf
longer haystack:
3061 ms for IndexOf16
3391 ms for string.IndexOf
Update2: This version is faster yet (especially for the long haystack case):
public static unsafe int IndexOf16(string s, int startIndex, char c, int count = -1) {
if (s == null) throw new ArgumentNullException("s");
if (startIndex < 0 || startIndex >= s.Length) throw new ArgumentOutOfRangeException("startIndex");
if (count == -1) count = s.Length - startIndex;
if (count < 0 || count > s.Length - startIndex) throw new ArgumentOutOfRangeException("count");
int endIndex = startIndex + count;
fixed (char* cs = s) {
char* cp = cs + startIndex;
for (int i = startIndex; i <= endIndex; i++, cp++) {
if (*cp == c) return i;
}
return -1;
}
}
Update 4:
Based on the discussion with LastCoder I believe this to be architecture depended. My Xeon W3550 at works seems to prefer this version, while his i7 seems to like the buildin version. My home machine (Athlon II) appears to be in between. I am surprised about the large difference though.
Possibility 1)
This may not hold (as true) in C# but when I did optimization work for x86-64 assembler I quickly found out while benchmarking that calling code from a DLL (marked external) was slower than implementing the same exact function within my executable. The most obvious reason is paging and memory, the DLL (external) method is loaded far away in memory from the rest of the running code and if it wasn't accessed previously it'll need to be paged in. Your benchmarking code should do some warm up loops of the functions you are benchmarking to make sure they are paged in memory first before you time them.
Possibility 2)
Microsoft tends not to optimize string functions to the fullest, so out optimizing a native string length, substring, indexof etc. isn't really unheard of. Anecdote; in x86-64 assembler I was able to create a version of WinXP64's RtlInitUnicodeString function that ran 2x faster in almost all practical use cases.
Possibility 3) Your benchmarking code shows that you're using the 2 parameter overload for IndexOf, this function likely calls the 3 parameter overload IndexOf(Char, Int32, Int32) which adds an extra overhead to each iteration.
This may be even faster because your removing the i variable increment per iteration.
char* cp = cs + startIndex;
char* cpEnd = cp + endIndex;
while (cp <= cpEnd) {
if (*cp == c) return cp - cs;
cp++;
}
edit In reply regarding (2) for your curiosity, coded back in 2005 and used to patch the ntdll.dll of my WinXP64 machine. http://board.flatassembler.net/topic.php?t=4467
RtlInitUnicodeString_Opt: ;;rcx=buff rdx=ucharstr 77bytes
xor r9d,r9d
test rdx,rdx
mov dword[rcx],r9d
mov [rcx+8],rdx
jz .end
mov r8,rdx
.scan:
mov eax,dword[rdx]
test ax,ax
jz .one
add rdx,4
shr eax,16
test ax,ax
jz .two
jmp .scan
.two:
add rdx,2
.one:
mov eax,0fffch
sub rdx,r8
cmp rdx,0fffeh
cmovnb rdx,rax
mov [ecx],dx
add dx,2
mov [ecx+2],dx
ret
.end:
retn
edit 2 Running your example code (updated with your fastest version) the string.IndexOf runs faster on my Intel i7, 4GB RAM, Win7 64bit.
short haystack 500k runs
2590 ms for IndexOf16
2287 ms for string.IndexOf
longer haystack:
3549 ms for IndexOf16
2757 ms for string.IndexOf
Optimizations are sometimes very architecture reliant.
If you really make such a micro measurement check every single bit counts. Within the MS implementation (as seen in the link you provided) they also check if s is null and throw a NullArgumentException. Also this is the implementation including the count parameter. So they additionally check if count as a correct value and throw a ArgumentOutOfRangeException.
I think these little checks to make the code more robust are enough to make them a little bit slower if you call them so often in such a short time.
This might have somthing to do with the "fixed" statement as "It pins the location of the src and dst objects in memory so that they will not be moved by garbage collection." perhaps speeding up the methods?
Also "Unsafe code increases the performance by getting rid of array bounds checks." this could also be why.
Above comments taken from MSDN
Related
I was browsing crackstation.net website and came across this code which was commented as following:
Compares two byte arrays in length-constant time. This comparison method is used so that password hashes cannot be extracted from on-line systems using a timing attack and then attacked off-line.
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
for (int i = 0; i < a.Length && i < b.Length; i++)
diff |= (uint)(a[i] ^ b[i]);
return diff == 0;
}
Can anyone please explain how does this function actual works, why do we need to convert the length to an unsigned integer and how this method avoids a timing attack? What does the line diff |= (uint)(a[i] ^ b[i]); do?
This sets diff based on whether there's a difference between a and b.
It avoids a timing attack by always walking through the entirety of the shorter of the two of a and b, regardless of whether there's a mismatch sooner than that or not.
The diff |= (uint)(a[i] ^ (uint)b[i]) takes the exclusive-or of a byte of a with a corresponding byte of b. That will be 0 if the two bytes are the same, or non-zero if they're different. It then ors that with diff.
Therefore, diff will be set to non-zero in an iteration if a difference was found between the inputs in that iteration. Once diff is given a non-zero value at any iteration of the loop, it will retain the non-zero value through further iterations.
Therefore, the final result in diff will be non-zero if any difference is found between corresponding bytes of a and b, and 0 only if all bytes (and the lengths) of a and b are equal.
Unlike a typical comparison, however, this will always execute the loop until all the bytes in the shorter of the two inputs have been compared to bytes in the other. A typical comparison would have an early-out where the loop would be broken as soon as a mismatch was found:
bool equal(byte a[], byte b[]) {
if (a.length() != b.length())
return false;
for (int i=0; i<a.length(); i++)
if (a[i] != b[i])
return false;
return true;
}
With this, based on the amount of time consumed to return false, we can learn (at least an approximation of) the number of bytes that matched between a and b. Let's say the initial test of length takes 10 ns, and each iteration of the loop takes another 10 ns. Based on that, if it returns false in 50 ns, we can quickly guess that we have the right length, and the first four bytes of a and b match.
Even without knowing the exact amounts of time, we can still use the timing differences to determine the correct string. We start with a string of length 1, and increase that one byte at a time until we see an increase in the time taken to return false. Then we run through all the possible values in the first byte until we see another increase, indicating that it has executed another iteration of the loop. Continue with the same for successive bytes until all bytes match and we get a return of true.
The original is still open to a little bit of a timing attack -- although we can't easily determine the contents of the correct string based on timing, we can at least find the string length based on timing. Since it only compares up to the shorter of the two strings, we can start with a string of length 1, then 2, then 3, and so on until the time becomes stable. As long as the time is increasing our proposed string is shorter than the correct string. When we give it longer strings, but the time remains constant, we know our string is longer than the correct string. The correct length of string will be the shortest one that takes that maximum duration to test.
Whether this is useful or not depends on the situation, but it's clearly leaking some information, regardless. For truly maximum security, we'd probably want to append random garbage to the end of the real string to make it the length of the user's input, so the time stays proportional to the length of the input, regardless of whether it's shorter, equal to, or longer than the correct string.
this version goes on for the length of the input 'a'
private static bool SlowEquals(byte[] a, byte[] b)
{
uint diff = (uint)a.Length ^ (uint)b.Length;
byte[] c = new byte[] { 0 };
for (int i = 0; i < a.Length; i++)
diff |= (uint)(GetElem(a, i, c, 0) ^ GetElem(b, i, c, 0));
return diff == 0;
}
private static byte GetElem(byte[] x, int i, byte[] c, int i0)
{
bool ok = (i < x.Length);
return (ok ? x : c)[ok ? i : i0];
}
Doubtless this seems like a strange request, given the availability of ToString() and Convert.ToString(), but I need to convert an unsigned integer (i.e. UInt32) to its string representation, but I need to store the answer into a char[].
The reason is that I am working with character arrays for efficiency, and as the target char[] is initialised as a member to char[10] (to hold the string representation of UInt32.MaxValue) on object creation, it should be theoretically possible to do the conversion without generating any garbage (by which I mean without generating any temporary objects in the managed heap.)
Can anyone see a neat way to achieve this?
(I'm working in Framework 3.5SP1 in case that is any way relevant.)
Further to my comment above, I wondered if log10 was too slow, so I wrote a version that doesn't use it.
For four digit numbers this version is about 35% quicker, falling to about 16% quicker for ten digit numbers.
One disadvantage is that it requires space for the full ten digits in the buffer.
I don't swear it doesn't have any bugs!
public static int ToCharArray2(uint value, char[] buffer, int bufferIndex)
{
const int maxLength = 10;
if (value == 0)
{
buffer[bufferIndex] = '0';
return 1;
}
int startIndex = bufferIndex + maxLength - 1;
int index = startIndex;
do
{
buffer[index] = (char)('0' + value % 10);
value /= 10;
--index;
}
while (value != 0);
int length = startIndex - index;
if (bufferIndex != index + 1)
{
while (index != startIndex)
{
++index;
buffer[bufferIndex] = buffer[index];
++bufferIndex;
}
}
return length;
}
Update
I should add, I'm using a Pentium 4. More recent processors may calculate transcendental functions faster.
Conclusion
I realised yesterday that I'd made a schoolboy error and run the benchmarks on a debug build. So I ran them again but it didn't actually make much difference. The first column shows the number of digits in the number being converted. The remaining columns show the times in milliseconds to convert 500,000 numbers.
Results for uint:
luc1 arx henk1 luc3 henk2 luc2
1 715 217 966 242 837 244
2 877 420 1056 541 996 447
3 1059 608 1169 835 1040 610
4 1184 795 1282 1116 1162 801
5 1403 969 1405 1396 1279 978
6 1572 1149 1519 1674 1399 1170
7 1740 1335 1648 1952 1518 1352
8 1922 1675 1868 2233 1750 1545
9 2087 1791 2005 2511 1893 1720
10 2263 2103 2139 2797 2012 1985
Results for ulong:
luc1 arx henk1 luc3 henk2 luc2
1 802 280 998 390 856 317
2 912 516 1102 729 954 574
3 1066 746 1243 1060 1056 818
4 1300 1141 1362 1425 1170 1210
5 1557 1363 1503 1742 1306 1436
6 1801 1603 1612 2233 1413 1672
7 2269 1814 1723 2526 1530 1861
8 2208 2142 1920 2886 1634 2149
9 2360 2376 2063 3211 1775 2339
10 2615 2622 2213 3639 2011 2697
11 3048 2996 2513 4199 2244 3011
12 3413 3607 2507 4853 2326 3666
13 3848 3988 2663 5618 2478 4005
14 4298 4525 2748 6302 2558 4637
15 4813 5008 2974 7005 2712 5065
16 5161 5654 3350 7986 2994 5864
17 5997 6155 3241 8329 2999 5968
18 6490 6280 3296 8847 3127 6372
19 6440 6720 3557 9514 3386 6788
20 7045 6616 3790 10135 3703 7268
luc1: Lucero's first function
arx: my function
henk1: Henk's function
luc3 Lucero's third function
henk2: Henk's function without the copy to the char array; i.e. just test the performance of ToString().
luc2: Lucero's second function
The peculiar order is the order they were created in.
I also ran the test without henk1 and henk2 so there would be no garbage collection. The times for the other three functions were nearly identical. Once the benchmark had gone past three digits the memory use was stable: so GC was happening during Henk's functions and didn't have a detrimental effect on the other functions.
Conclusion: just call ToString()
The following code does it, with the following caveat: it does not respect the culture settings, but always outputs normal decimal digits.
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int len = (int)Math.Ceiling(Math.Log10(value));
for (int i = len-1; i>= 0; i--) {
buffer[bufferIndex+i] = (char)('0'+(value%10));
value /= 10;
}
return len;
}
The returned value is how much of the char[] has been used.
Edit (for arx): the following version avoids the floating-point math and swaps the buffer in-place:
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int bufferEndIndex = bufferIndex;
while (value > 0) {
buffer[bufferEndIndex++] = (char)('0'+(value%10));
value /= 10;
}
int len = bufferEndIndex-bufferIndex;
while (--bufferEndIndex > bufferIndex) {
char ch = buffer[bufferEndIndex];
buffer[bufferEndIndex] = buffer[bufferIndex];
buffer[bufferIndex++] = ch;
}
return len;
}
And here yet another variation which computes the number of digits in a small loop:
public static int ToCharArray(uint value, char[] buffer, int bufferIndex) {
if (value == 0) {
buffer[bufferIndex] = '0';
return 1;
}
int len = 1;
for (uint rem = value/10; rem > 0; rem /= 10) {
len++;
}
for (int i = len-1; i>= 0; i--) {
buffer[bufferIndex+i] = (char)('0'+(value%10));
value /= 10;
}
return len;
}
I leave the benchmarking to whoever wants to do it... ;)
I'm coming little late to the party, but I guess you probably cannot get faster and less memory demanding results than with simple reinterpreting of the memory:
[System.Security.SecuritySafeCritical]
public static unsafe char[] GetChars(int value, char[] chars)
{
//TODO: if needed to use accross machines then
// this should also use BitConverter.IsLittleEndian to detect little/big endian
// and order bytes appropriately
fixed (char* numPtr = chars)
*(int*)numPtr = value;
return chars;
}
[System.Security.SecuritySafeCritical]
public static unsafe int ToInt32(char[] value)
{
//TODO: if needed to use accross machines then
// this should also use BitConverter.IsLittleEndian to detect little/big endian
// and order bytes appropriately
fixed (char* numPtr = value)
return *(int*)numPtr;
}
This is just a demonstration of an idea - you'd obviously need to add check for char array size and make sure that you have proper byte-ordering encoding. You can peek into reflected helper methods of BitConverter for those checks.
I am trying to write a function to determine whether two equal-size bitmaps are identical or not. The function I have right now simply compares a pixel at a time in each bitmap, returning false at the first non-equal pixel.
While this works, and works well for small bitmaps, in production I'm going to be using this in a tight loop and on larger images, so I need a better way. Does anyone have any recommendations?
The language I'm using is C# by the way - and yes, I am already using the .LockBits method. =)
Edit: I've coded up implementations of some of the suggestions given, and here are the benchmarks. The setup: two identical (worst-case) bitmaps, 100x100 in size, with 10,000 iterations each. Here are the results:
CompareByInts (Marc Gravell) : 1107ms
CompareByMD5 (Skilldrick) : 4222ms
CompareByMask (GrayWizardX) : 949ms
In CompareByInts and CompareByMask I'm using pointers to access the memory directly; in the MD5 method I'm using Marshal.Copy to retrieve a byte array and pass that as an argument to MD5.ComputeHash. CompareByMask is only slightly faster, but given the context I think any improvement is useful.
Thanks everyone. =)
Edit 2: Forgot to turn optimizations on - doing that gives GrayWizardX's answer even more of a boost:
CompareByInts (Marc Gravell) : 944ms
CompareByMD5 (Skilldrick) : 4275ms
CompareByMask (GrayWizardX) : 630ms
CompareByMemCmp (Erik) : 105ms
Interesting that the MD5 method didn't improve at all.
Edit 3: Posted my answer (MemCmp) which blew the other methods out of the water. o.O
Edit 8-31-12: per Joey's comment below, be mindful of the format of the bitmaps you compare. They may contain padding on the strides that render the bitmaps unequal, despite being equivalent pixel-wise. See this question for more details.
Reading this answer to a question regarding comparing byte arrays has yielded a MUCH FASTER method: using P/Invoke and the memcmp API call in msvcrt. Here's the code:
[DllImport("msvcrt.dll")]
private static extern int memcmp(IntPtr b1, IntPtr b2, long count);
public static bool CompareMemCmp(Bitmap b1, Bitmap b2)
{
if ((b1 == null) != (b2 == null)) return false;
if (b1.Size != b2.Size) return false;
var bd1 = b1.LockBits(new Rectangle(new Point(0, 0), b1.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
var bd2 = b2.LockBits(new Rectangle(new Point(0, 0), b2.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
try
{
IntPtr bd1scan0 = bd1.Scan0;
IntPtr bd2scan0 = bd2.Scan0;
int stride = bd1.Stride;
int len = stride * b1.Height;
return memcmp(bd1scan0, bd2scan0, len) == 0;
}
finally
{
b1.UnlockBits(bd1);
b2.UnlockBits(bd2);
}
}
If you are trying to determine if they are 100% equal, you can invert one and add it to the other if its zero they are identical. Extending this using unsafe code, take 64 bits at a time as a long and do the math that way, any differences can cause an immediate fail.
If the images are not 100% identical (comparing png to jpeg), or if you are not looking for a 100% match then you have some more work ahead of you.
Good luck.
Well, you're using .LockBits, so presumably you're using unsafe code. Rather than treating each row origin (Scan0 + y * Stride) as a byte*, consider treating it as an int*; int arithmetic is pretty quick, and you only have to do 1/4 as much work. And for images in ARGB you might still be talking in pixels, making the math simple.
Could you take a hash of each and compare? It would be slightly probabilistic, but practically not.
Thanks to Ram, here's a sample implementation of this technique.
If the original problem is just to find the exact duplicates among two bitmaps, then just a bit level comparison will have to do. I don't know C# but in C I would use the following function:
int areEqual (long size, long *a, long *b)
{
long start = size / 2;
long i;
for (i = start; i != size; i++) { if (a[i] != b[i]) return 0 }
for (i = 0; i != start; i++) { if (a[i] != b[i]) return 0 }
return 1;
}
I would start looking in the middle because I suspect there is a much better chance of finding unequal bits near the middle of the image than the beginning; of course, this would really depend on the images you are deduping, selecting a random place to start may be best.
If you are trying to find the exact duplicates among hundreds of images then comparing all pairs of them is unnecessary. First compute the MD5 hash of each image and place it in a list of pairs (md5Hash, imageId); then sort the list by the m5Hash. Next, only do pairwise comparisons on the images that have the same md5Hash.
If these bitmaps are already on your graphics card then you can parallelize such a check by doing it on the graphics card using a language like CUDA or OpenCL.
I'll explain in terms of CUDA, since that's the one I know. Basically CUDA lets you write general purpose code to run in parallel across each node of your graphics card. You can access bitmaps that are in shared memory. Each invocation of the function is also given an index within the set of parallel runs. So, for a problem like this, you'd just run one of the above comparison functions for some subset of the bitmap - using parallelization to cover the entire bitmap. Then, just write a 1 to a certain memory location if the comparison fails (and write nothing if it succeeds).
If you don't already have the bitmaps on your graphics card, this probably isn't the way to go, since the costs for loading the two bitmaps on your card will easily eclipse the savings such parallelization will gain you.
Here's some (pretty bad) example code (it's been a little while since I programmed CUDA). There's better ways to access bitmaps that are already loaded as textures, but I didn't bother here.
// kernel to run on GPU, once per thread
__global__ void compare_bitmaps(long const * const A, long const * const B, char * const retValue, size_t const len)
{
// divide the work equally among the threads (each thread is in a block, each block is in a grid)
size_t const threads_per_block = blockDim.x * blockDim.y * blockDim.z;
size_t const len_to_compare = len / (gridDim.x * gridDim.y * gridDim.z * threads_per_block);
# define offset3(idx3,dim3) (idx3.x + dim3.x * (idx3.y + dim3.y * idx3.z))
size_t const start_offset = len_to_compare * (offset3(threadIdx,blockDim) + threads_per_block * offset3(blockIdx,gridDim));
size_t const stop_offset = start_offset + len_to_compare;
# undef offset3
size_t i;
for (i = start_offset; i < stop_offset; i++)
{
if (A[i] != B[i])
{
*retValue = 1;
break;
}
}
return;
}
If you can implement something like Duff's Device in your language, that might give you a significant speed boost over a simple loop. Usually it's used for copying data, but there's no reason it can't be used for comparing data instead.
Or, for that matter, you may just want to use some equivalent to memcmp().
You could try to add them to a database "blob" then use the database engine to compare their binaries. This would only give you a yes or no answer to whether the binary data is the same. It would be very easy to make 2 images that produce the same graphic but have different binary though.
You could also select a few random pixels and compare them, then if they are the same continue with more until you've checked all the pixels. This would only return a faster negative match though, it still would take as long to find 100% positive matches
Based on the approach of comparing hashes instead of comparing every single pixel, this is what I use:
public static class Utils
{
public static byte[] ShaHash(this Image image)
{
var bytes = new byte[1];
bytes = (byte[])(new ImageConverter()).ConvertTo(image, bytes.GetType());
return (new SHA256Managed()).ComputeHash(bytes);
}
public static bool AreEqual(Image imageA, Image imageB)
{
if (imageA.Width != imageB.Width) return false;
if (imageA.Height != imageB.Height) return false;
var hashA = imageA.ShaHash();
var hashB = imageB.ShaHash();
return !hashA
.Where((nextByte, index) => nextByte != hashB[index])
.Any();
}
]
Usage is straight forward:
bool isMatch = Utils.AreEqual(bitmapOne, bitmapTwo);
Can people recommend quick and simple ways to combine the hash codes of two objects. I am not too worried about collisions since I have a Hash Table which will handle that efficiently I just want something that generates a code quickly as possible.
Reading around SO and the web there seem to be a few main candidates:
XORing
XORing with Prime Multiplication
Simple numeric operations like multiplication/division (with overflow checking or wrapping around)
Building a String and then using the String classes Hash Code method
What would people recommend and why?
I would personally avoid XOR - it means that any two equal values will result in 0 - so hash(1, 1) == hash(2, 2) == hash(3, 3) etc. Also hash(5, 0) == hash(0, 5) etc which may come up occasionally. I have deliberately used it for set hashing - if you want to hash a sequence of items and you don't care about the ordering, it's nice.
I usually use:
unchecked
{
int hash = 17;
hash = hash * 31 + firstField.GetHashCode();
hash = hash * 31 + secondField.GetHashCode();
return hash;
}
That's the form that Josh Bloch suggests in Effective Java. Last time I answered a similar question I managed to find an article where this was discussed in detail - IIRC, no-one really knows why it works well, but it does. It's also easy to remember, easy to implement, and easy to extend to any number of fields.
If you are using .NET Core 2.1 or later or .NET Framework 4.6.1 or later, consider using the System.HashCode struct to help with producing composite hash codes. It has two modes of operation: Add and Combine.
An example using Combine, which is usually simpler and works for up to eight items:
public override int GetHashCode()
{
return HashCode.Combine(object1, object2);
}
An example of using Add:
public override int GetHashCode()
{
var hash = new HashCode();
hash.Add(this.object1);
hash.Add(this.object2);
return hash.ToHashCode();
}
Pros:
Part of .NET itself, as of .NET Core 2.1/.NET Standard 2.1 (though, see con below)
For .NET Framework 4.6.1 and later, the Microsoft.Bcl.HashCode NuGet package can be used to backport this type.
Looks to have good performance and mixing characteristics, based on the work the author and the reviewers did before merging this into the corefx repo
Handles nulls automatically
Overloads that take IEqualityComparer instances
Cons:
Not available on .NET Framework before .NET 4.6.1. HashCode is part of .NET Standard 2.1. As of September 2019, the .NET team has no plans to support .NET Standard 2.1 on the .NET Framework, as .NET Core/.NET 5 is the future of .NET.
General purpose, so it won't handle super-specific cases as well as hand-crafted code
While the template outlined in Jon Skeet's answer works well in general as a hash function family, the choice of the constants is important and the seed of 17 and factor of 31 as noted in the answer do not work well at all for common use cases. In most use cases, the hashed values are much closer to zero than int.MaxValue, and the number of items being jointly hashed are a few dozen or less.
For hashing an integer tuple {x, y} where -1000 <= x <= 1000 and -1000 <= y <= 1000, it has an abysmal collision rate of almost 98.5%. For example, {1, 0} -> {0, 31}, {1, 1} -> {0, 32}, etc. If we expand the coverage to also include n-tuples where 3 <= n <= 25, it does less terrible with a collision rate of about 38%. But we can do much better.
public static int CustomHash(int seed, int factor, params int[] vals)
{
int hash = seed;
foreach (int i in vals)
{
hash = (hash * factor) + i;
}
return hash;
}
I wrote a Monte Carlo sampling search loop that tested the method above with various values for seed and factor over various random n-tuples of random integers i. Allowed ranges were 2 <= n <= 25 (where n was random but biased toward the lower end of the range) and -1000 <= i <= 1000. At least 12 million unique collision tests were performed for each seed and factor pair.
After about 7 hours running, the best pair found (where the seed and factor were both limited to 4 digits or less) was: seed = 1009, factor = 9176, with a collision rate of 0.1131%. In the 5- and 6-digit areas, even better options exist. But I selected the top 4-digit performer for brevity, and it peforms quite well in all common int and char hashing scenarios. It also seems to work fine with integers of much greater magnitudes.
It is worth noting that "being prime" did not seem to be a general prerequisite for good performance as a seed and/or factor although it likely helps. 1009 noted above is in fact prime, but 9176 is not. I explicitly tested variations on this where I changed factor to various primes near 9176 (while leaving seed = 1009) and they all performed worse than the above solution.
Lastly, I also compared against the generic ReSharper recommendation function family of hash = (hash * factor) ^ i; and the original CustomHash() as noted above seriously outperforms it. The ReSharper XOR style seems to have collision rates in the 20-30% range for common use case assumptions and should not be used in my opinion.
Use the combination logic in tuple. The example is using c#7 tuples.
(field1, field2).GetHashCode();
I presume that .NET Framework team did a decent job in testing their System.String.GetHashCode() implementation, so I would use it:
// System.String.GetHashCode(): http://referencesource.microsoft.com/#mscorlib/system/string.cs,0a17bbac4851d0d4
// System.Web.Util.StringUtil.GetStringHashCode(System.String): http://referencesource.microsoft.com/#System.Web/Util/StringUtil.cs,c97063570b4e791a
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
int hash1 = (5381 << 16) + 5381;
int hash2 = hash1;
int i = 0;
foreach (var hashCode in hashCodes)
{
if (i % 2 == 0)
hash1 = ((hash1 << 5) + hash1 + (hash1 >> 27)) ^ hashCode;
else
hash2 = ((hash2 << 5) + hash2 + (hash2 >> 27)) ^ hashCode;
++i;
}
return hash1 + (hash2 * 1566083941);
}
Another implementation is from System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32) and System.Array.CombineHashCodes(System.Int32, System.Int32) methods. This one is simpler, but probably doesn't have such a good distribution as the method above:
// System.Web.Util.HashCodeCombiner.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#System.Web/Util/HashCodeCombiner.cs,21fb74ad8bb43f6b
// System.Array.CombineHashCodes(System.Int32, System.Int32): http://referencesource.microsoft.com/#mscorlib/system/array.cs,87d117c8cc772cca
public static int CombineHashCodes(IEnumerable<int> hashCodes)
{
int hash = 5381;
foreach (var hashCode in hashCodes)
hash = ((hash << 5) + hash) ^ hashCode;
return hash;
}
This is a repackaging of Special Sauce's brilliantly researched solution.
It makes use of Value Tuples (ITuple).
This allows defaults for the parameters seed and factor.
public static int CombineHashes(this ITuple tupled, int seed=1009, int factor=9176)
{
var hash = seed;
for (var i = 0; i < tupled.Length; i++)
{
unchecked
{
hash = hash * factor + tupled[i].GetHashCode();
}
}
return hash;
}
Usage:
var hash1 = ("Foo", "Bar", 42).CombineHashes();
var hash2 = ("Jon", "Skeet", "Constants").CombineHashes(seed=17, factor=31);
If your input hashes are the same size, evenly distributed and not related to each other then an XOR should be OK. Plus it's fast.
The situation I'm suggesting this for is where you want to do
H = hash(A) ^ hash(B); // A and B are different types, so there's no way A == B.
of course, if A and B can be expected to hash to the same value with a reasonable (non-negligible) probability, then you should not use XOR in this way.
If you're looking for speed and don't have too many collisions, then XOR is fastest. To prevent a clustering around zero, you could do something like this:
finalHash = hash1 ^ hash2;
return finalHash != 0 ? finalHash : hash1;
Of course, some prototyping ought to give you an idea of performance and clustering.
Assuming you have a relevant toString() function (where your different fields shall appear), I would just return its hashcode:
this.toString().hashCode();
This is not very fast, but it should avoid collisions quite well.
I would recommend using the built-in hash functions in System.Security.Cryptography rather than rolling your own.
I am looking to refactor a c# method into a c function in an attempt to gain some speed, and then call the c dll in c# to allow my program to use the functionality.
Currently the c# method takes a list of integers and returns a list of lists of integers. The method calculated the power set of the integers so an input of 3 ints would produce the following output (at this stage the values of the ints is not important as it is used as an internal weighting value)
1
2
3
1,2
1,3
2,3
1,2,3
Where each line represents a list of integers. The output indicates the index (with an offset of 1) of the first list, not the value. So 1,2 indicates that the element at index 0 and 1 are an element of the power set.
I am unfamiliar with c, so what are my best options for data structures that will allow the c# to access the returned data?
Thanks in advance
Update
Thank you all for your comments so far. Here is a bit of a background to the nature of the problem.
The iterative method for calculating the power set of a set is fairly straight forward. Two loops and a bit of bit manipulation is all there is to it really. It just get called..a lot (in fact billions of times if the size of the set is big enough).
My thoughs around using c (c++ as people have pointed out) are that it gives more scope for performance tuning. A direct port may not offer any increase, but it opens the way for more involved methods to get a bit more speed out of it. Even a small increase per iteration would equate to a measurable increase.
My idea was to port a direct version and then work to increase it. And then refactor it over time (with help from everyone here at SO).
Update 2
Another fair point from jalf, I dont have to use list or equivelent. If there is a better way then I am open to suggestions. The only reason for the list was that each set of results is not the same size.
The code so far...
public List<List<int>> powerset(List<int> currentGroupList)
{
_currentGroupList = currentGroupList;
int max;
int count;
//Count the objects in the group
count = _currentGroupList.Count;
max = (int)Math.Pow(2, count);
//outer loop
for (int i = 0; i < max; i++)
{
_currentSet = new List<int>();
//inner loop
for (int j = 0; j < count; j++)
{
if ((i & (1 << j)) == 0)
{
_currentSetList.Add(_currentGroupList.ElementAt(j));
}
}
outputList.Add(_currentSetList);
}
return outputList;
}
As you can see, not a lot to it. It just goes round and round a lot!
I accept that the creating and building of lists may not be the most efficient way, but I need some way of providing the results back in a manageable way.
Update 2
Thanks for all the input and implementation work. Just to clarify a couple of points raised: I dont need the output to be in 'natural order', and also I am not that interested in the empty set being returned.
hughdbrown's implementation is intesting but i think that i will need to store the results (or at least a subset of them) at some point. It sounds like memory limitiations will apply long before running time becomes a real issue.
Partly because of this, I think I can get away with using bytes instead of integers, giving more potential storage.
The question really is then: Have we reached the maximum speed for this calcualtion in C#? Does the option of unmanaged code provide any more scope. I know in many respects the answer is futile, as even if we havled the time to run, it would only allow an extra values in the original set.
Also, be sure that moving to C/C++ is really what you need to do for speed to begin with. Instrument the original C# method (standalone, executed via unit tests), instrument the new C/C++ method (again, standalone via unit tests) and see what the real world difference is.
The reason I bring this up is that I fear it may be a pyrhhic victory -- using Smokey Bacon's advice, you get your list class, you're in "faster" C++, but there's still a cost to calling that DLL: Bouncing out of the runtime with P/Invoke or COM interop carries a fairly substantial performance cost.
Be sure you're getting your "money's worth" out of that jump before you do it.
Update based on the OP's Update
If you're calling this loop repeatedly, you need to absolutely make sure that the entire loop logic is encapsulated in a single interop call -- otherwise the overhead of marshalling (as others here have mentioned) will definitely kill you.
I do think, given the description of the problem, that the issue isn't that C#/.NET is "slower" than C, but more likely that the code needs to be optimized. As another poster here mentioned, you can use pointers in C# to seriously boost performance in this kind of loop, without the need for marshalling. I'd look into that first, before jumping into a complex interop world, for this scenario.
If you are looking to use C for a performance gain, most likely you are planning to do so through the use of pointers. C# does allow for use of pointers, using the unsafe keyword. Have you considered that?
Also how will you be calling this code.. will it be called often (e.g. in a loop?) If so, marshalling the data back and forth may more than offset any performance gains.
Follow Up
Take a look at Native code without sacrificing .NET performance for some interop options. There are ways to interop without too much of a performance loss, but those interops can only happen with the simplest of data types.
Though I still think that you should investigate speeding up your code using straight .NET.
Follow Up 2
Also, may I suggest that if you have your heart set on mixing native code and managed code, that you create your library using c++/cli. Below is a simple example. Note that I am not a c++/cli guy, and this code doesn't do anything useful...its just meant to show how easily you can mix native and managed code.
#include "stdafx.h"
using namespace System;
System::Collections::Generic::List<int> ^MyAlgorithm(System::Collections::Generic::List<int> ^sourceList);
int main(array<System::String ^> ^args)
{
System::Collections::Generic::List<int> ^intList = gcnew System::Collections::Generic::List<int>();
intList->Add(1);
intList->Add(2);
intList->Add(3);
intList->Add(4);
intList->Add(5);
Console::WriteLine("Before Call");
for each(int i in intList)
{
Console::WriteLine(i);
}
System::Collections::Generic::List<int> ^modifiedList = MyAlgorithm(intList);
Console::WriteLine("After Call");
for each(int i in modifiedList)
{
Console::WriteLine(i);
}
}
System::Collections::Generic::List<int> ^MyAlgorithm(System::Collections::Generic::List<int> ^sourceList)
{
int* nativeInts = new int[sourceList->Count];
int nativeIntArraySize = sourceList->Count;
//Managed to Native
for(int i=0; i<sourceList->Count; i++)
{
nativeInts[i] = sourceList[i];
}
//Do Something to native ints
for(int i=0; i<nativeIntArraySize; i++)
{
nativeInts[i]++;
}
//Native to Managed
System::Collections::Generic::List<int> ^returnList = gcnew System::Collections::Generic::List<int>();
for(int i=0; i<nativeIntArraySize; i++)
{
returnList->Add(nativeInts[i]);
}
return returnList;
}
What makes you think you'll gain speed by calling into C code? C isn't magically faster than C#. It can be, of course, but it can also easily be slower (and buggier). Especially when you factor in the p/invoke calls into native code, it's far from certain that this approach will speed up anything.
In any case, C doesn't have anything like List. It has raw arrays and pointers (and you could argue that int** is more or less equivalent), but you're probably better off using C++, which does have equivalent datastructures. In particular, std::vector.
There are no simple ways to expose this data to C# however, since it will be scattered pretty much randomly (each list is a pointer to some dynamically allocated memory somewhere)
However, I suspect the biggest performance improvement comes from improving the algorithm in C#.
Edit:
I can see several things in your algorithm that seem suboptimal. Constructing a list of lists isn't free. Perhaps you can create a single list and use different offsets to represent each sublist. Or perhaps using 'yield return' and IEnumerable instead of explicitly constructing lists might be faster.
Have you profiled your code, found out where the time is being spent?
This returns one set of a powerset at a time. It is based on python code here. It works for powersets of over 32 elements. If you need fewer than 32, you can change long to int. It is pretty fast -- faster than my previous algorithm and faster than (my modified to use yield return version of) P Daddy's code.
static class PowerSet4<T>
{
static public IEnumerable<IList<T>> powerset(T[] currentGroupList)
{
int count = currentGroupList.Length;
Dictionary<long, T> powerToIndex = new Dictionary<long, T>();
long mask = 1L;
for (int i = 0; i < count; i++)
{
powerToIndex[mask] = currentGroupList[i];
mask <<= 1;
}
Dictionary<long, T> result = new Dictionary<long, T>();
yield return result.Values.ToArray();
long max = 1L << count;
for (long i = 1L; i < max; i++)
{
long key = i & -i;
if (result.ContainsKey(key))
result.Remove(key);
else
result[key] = powerToIndex[key];
yield return result.Values.ToArray();
}
}
}
You can download all the fastest versions I have tested here.
I really think that using yield return is the change that makes calculating large powersets possible. Allocating large amounts of memory upfront increases runtime dramatically and causes algorithms to fail for lack of memory very early on. Original Poster should figure out how many sets of a powerset he needs at once. Holding all of them is not really an option with >24 elements.
I'm also going to put in a vote for tuning-up your C#, particularly by going to 'unsafe' code and losing what might be a lot of bounds-checking overhead.
Even though it's 'unsafe', it's no less 'safe' than C/C++, and it's dramatically easier to get right.
Below is a C# algorithm that should be much faster (and use less memory) than the algorithm you posted. It doesn't use the neat binary trick yours uses, and as a result, the code is a good bit longer. It has a few more for loops than yours, and might take a time or two stepping through it with the debugger to fully grok it. But it's actually a simpler approach, once you understand what it's doing.
As a bonus, the returned sets are in a more "natural" order. It would return subsets of the set {1 2 3} in the same order you listed them in your question. That wasn't a focus, but is a side effect of the algorithm used.
In my tests, I found this algorithm to be approximately 4 times faster than the algorithm you posted for a large set of 22 items (which was as large as I could go on my machine without excessive disk-thrashing skewing the results too much). One run of yours took about 15.5 seconds, and mine took about 3.6 seconds.
For smaller lists, the difference is less pronounced. For a set of only 10 items, yours ran 10,000 times in about 7.8 seconds, and mine took about 3.2 seconds. For sets with 5 or fewer items, they run close to the same time. With many iterations, yours runs a little faster.
Anyway, here's the code. Sorry it's so long; I tried to make sure I commented it well.
/*
* Made it static, because it shouldn't really use or modify state data.
* Making it static also saves a tiny bit of call time, because it doesn't
* have to receive an extra "this" pointer. Also, accessing a local
* parameter is a tiny bit faster than accessing a class member, because
* dereferencing the "this" pointer is not free.
*
* Made it generic so that the same code can handle sets of any type.
*/
static IList<IList<T>> PowerSet<T>(IList<T> set){
if(set == null)
throw new ArgumentNullException("set");
/*
* Caveat:
* If set.Count > 30, this function pukes all over itself without so
* much as wiping up afterwards. Even for 30 elements, though, the
* result set is about 68 GB (if "set" is comprised of ints). 24 or
* 25 elements is a practical limit for current hardware.
*/
int setSize = set.Count;
int subsetCount = 1 << setSize; // MUCH faster than (int)Math.Pow(2, setSize)
T[][] rtn = new T[subsetCount][];
/*
* We don't really need dynamic list allocation. We can calculate
* in advance the number of subsets ("subsetCount" above), and
* the size of each subset (0 through setSize). The performance
* of List<> is pretty horrible when the initial size is not
* guessed well.
*/
int subsetIndex = 0;
for(int subsetSize = 0; subsetSize <= setSize; subsetSize++){
/*
* The "indices" array below is part of how we implement the
* "natural" ordering of the subsets. For a subset of size 3,
* for example, we initialize the indices array with {0, 1, 2};
* Later, we'll increment each index until we reach setSize,
* then carry over to the next index. So, assuming a set size
* of 5, the second iteration will have indices {0, 1, 3}, the
* third will have {0, 1, 4}, and the fifth will involve a carry,
* so we'll have {0, 2, 3}.
*/
int[] indices = new int[subsetSize];
for(int i = 1; i < subsetSize; i++)
indices[i] = i;
/*
* Now we'll iterate over all the subsets we need to make for the
* current subset size. The number of subsets of a given size
* is easily determined with combination (nCr). In other words,
* if I have 5 items in my set and I want all subsets of size 3,
* I need 5-pick-3, or 5C3 = 5! / 3!(5 - 3)! = 10.
*/
for(int i = Combination(setSize, subsetSize); i > 0; i--){
/*
* Copy the items from the input set according to the
* indices we've already set up. Alternatively, if you
* just wanted the indices in your output, you could
* just dup the index array here (but make sure you dup!
* Otherwise the setup step at the bottom of this for
* loop will mess up your output list! You'll also want
* to change the function's return type to
* IList<IList<int>> in that case.
*/
T[] subset = new T[subsetSize];
for(int j = 0; j < subsetSize; j++)
subset[j] = set[indices[j]];
/* Add the subset to the return */
rtn[subsetIndex++] = subset;
/*
* Set up indices for next subset. This looks a lot
* messier than it is. It simply increments the
* right-most index until it overflows, then carries
* over left as far as it needs to. I've made the
* logic as fast as I could, which is why it's hairy-
* looking. Note that the inner for loop won't
* actually run as long as a carry isn't required,
* and will run at most once in any case. The outer
* loop will go through as few iterations as required.
*
* You may notice that this logic doesn't check the
* end case (when the left-most digit overflows). It
* doesn't need to, since the loop up above won't
* execute again in that case, anyway. There's no
* reason to waste time checking that here.
*/
for(int j = subsetSize - 1; j >= 0; j--)
if(++indices[j] <= setSize - subsetSize + j){
for(int k = j + 1; k < subsetSize; k++)
indices[k] = indices[k - 1] + 1;
break;
}
}
}
return rtn;
}
static int Combination(int n, int r){
if(r == 0 || r == n)
return 1;
/*
* The formula for combination is:
*
* n!
* ----------
* r!(n - r)!
*
* We'll actually use a slightly modified version here. The above
* formula forces us to calculate (n - r)! twice. Instead, we only
* multiply for the numerator the factors of n! that aren't canceled
* out by (n - r)! in the denominator.
*/
/*
* nCr == nC(n - r)
* We can use this fact to reduce the number of multiplications we
* perform, as well as the incidence of overflow, where r > n / 2
*/
if(r > n / 2) /* We DO want integer truncation here (7 / 2 = 3) */
r = n - r;
/*
* I originally used all integer math below, with some complicated
* logic and another function to handle cases where the intermediate
* results overflowed a 32-bit int. It was pretty ugly. In later
* testing, I found that the more generalized double-precision
* floating-point approach was actually *faster*, so there was no
* need for the ugly code. But if you want to see a giant WTF, look
* at the edit history for this post!
*/
double denominator = Factorial(r);
double numerator = n;
while(--r > 0)
numerator *= --n;
return (int)(numerator / denominator + 0.1/* Deal with rounding errors. */);
}
/*
* The archetypical factorial implementation is recursive, and is perhaps
* the most often used demonstration of recursion in text books and other
* materials. It's unfortunate, however, that few texts point out that
* it's nearly as simple to write an iterative factorial function that
* will perform better (although tail-end recursion, if implemented by
* the compiler, will help to close the gap).
*/
static double Factorial(int x){
/*
* An all-purpose factorial function would handle negative numbers
* correctly - the result should be Sign(x) * Factorial(Abs(x)) -
* but since we don't need that functionality, we're better off
* saving the few extra clock cycles it would take.
*/
/*
* I originally used all integer math below, but found that the
* double-precision floating-point version is not only more
* general, but also *faster*!
*/
if(x < 2)
return 1;
double rtn = x;
while(--x > 1)
rtn *= x;
return rtn;
}
Your list of results does not match the results your code would produce. In particular, you do not show generating the empty set.
If I were producing powersets that could have a few billion subsets, then generating each subset separately rather than all at once might cut down on your memory requirements, improving your code's speed. How about this:
static class PowerSet<T>
{
static long[] mask = { 1L << 0, 1L << 1, 1L << 2, 1L << 3,
1L << 4, 1L << 5, 1L << 6, 1L << 7,
1L << 8, 1L << 9, 1L << 10, 1L << 11,
1L << 12, 1L << 13, 1L << 14, 1L << 15,
1L << 16, 1L << 17, 1L << 18, 1L << 19,
1L << 20, 1L << 21, 1L << 22, 1L << 23,
1L << 24, 1L << 25, 1L << 26, 1L << 27,
1L << 28, 1L << 29, 1L << 30, 1L << 31};
static public IEnumerable<IList<T>> powerset(T[] currentGroupList)
{
int count = currentGroupList.Length;
long max = 1L << count;
for (long iter = 0; iter < max; iter++)
{
T[] list = new T[count];
int k = 0, m = -1;
for (long i = iter; i != 0; i &= (i - 1))
{
while ((mask[++m] & i) == 0)
;
list[k++] = currentGroupList[m];
}
yield return list;
}
}
}
Then your client code looks like this:
static void Main(string[] args)
{
int[] intList = { 1, 2, 3, 4 };
foreach (IList<int> set in PowerSet<int>.powerset(intList))
{
foreach (int i in set)
Console.Write("{0} ", i);
Console.WriteLine();
}
}
I'll even throw in a bit-twiddling algorithm with templated arguments for free. For added speed, you can wrap the powerlist() inner loop in an unsafe block. It doesn't make much difference.
On my machine, this code is slightly slower than the OP's code until the sets are 16 or larger. However, all times to 16 elements are less than 0.15 seconds. At 23 elements, it runs in 64% of the time. The original algorithm does not run on my machine for 24 or more elements -- it runs out of memory.
This code takes 12 seconds to generate the power set for the numbers 1 to 24, omitting screen I/O time. That's 16 million-ish in 12 seconds, or about 1400K per second. For a billion (which is what you quoted earlier), that would be about 760 seconds. How long do you think this should take?
Does it have to be C, or is C++ an option too? If C++, you can just its own list type from the STL. Otherwise, you'll have to implement your own list - look up linked lists or dynamically sized arrays for pointers on how to do this.
I concur with the "optimize .NET first" opinion. It's the most painless. I imagine that if you wrote some unmanaged .NET code using C# pointers, it'd be identical to C execution, except for the VM overhead.
P Daddy:
You could change your Combination() code to this:
static long Combination(long n, long r)
{
r = (r > n - r) ? (n - r) : r;
if (r == 0)
return 1;
long result = 1;
long k = 1;
while (r-- > 0)
{
result *= n--;
result /= k++;
}
return result;
}
This will reduce the multiplications and the chance of overflow to a minimum.