Memory alignment of classes in c#? - c#

(btw. This refers to 32 bit OS)
SOME UPDATES:
This is definitely an alignment issue
Sometimes the alignment (for whatever reason?) is so bad that access to the double is more than 50x slower than its fastest access.
Running the code on a 64 bit machine cuts down the issue, but I think it was still alternating between two timing (of which I could get similar results by changing the double to a float on a 32 bit machine)
Running the code under mono exhibits no issue -- Microsoft, any chance you can copy something from those Novell guys???
Is there a way to memory align the allocation of classes in c#?
The following demonstrates (I think!) the badness of not having doubles aligned correctly. It does some simple math on a double stored in a class, timing each run, running 5 timed runs on the variable before allocating a new one and doing it over again.
Basically the results looks like you either have a fast, medium or slow memory position (on my ancient processor, these end up around 40, 80 or 120ms per run)
I have tried playing with StructLayoutAttribute, but have had no joy - maybe something else is going on?
class Sample
{
class Variable { public double Value; }
static void Main()
{
const int COUNT = 10000000;
while (true)
{
var x = new Variable();
for (int inner = 0; inner < 5; ++inner)
{
// move allocation here to allocate more often so more probably to get 50x slowdown problem
var stopwatch = Stopwatch.StartNew();
var total = 0.0;
for (int i = 1; i <= COUNT; ++i)
{
x.Value = i;
total += x.Value;
}
if (Math.Abs(total - 50000005000000.0) > 1)
throw new ApplicationException(total.ToString());
Console.Write("{0}, ", stopwatch.ElapsedMilliseconds);
}
Console.WriteLine();
}
}
}
So I see lots of web pages about alignment of structs for interop, so what about alignment of classes?
(Or are my assumptions wrong, and there is another issue with the above?)
Thanks,
Paul.

Interesting look in the gears that run the machine. I have a bit of a problem explaining why there are multiple distinct values (I got 4) when a double can be aligned only two ways. I think alignment to the CPU cache line plays a role as well, although that only adds up to 3 possible timings.
Well, nothing you can do about it, the CLR only promises alignment for 4 byte values so that atomic updates on 32-bit machines are guaranteed. This is not just an issue with managed code, C/C++ has this problem too. Looks like the chip makers need to solve this one.
If it is critical then you could allocate unmanaged memory with Marshal.AllocCoTaskMem() and use an unsafe pointer that you can align just right. Same kind of thing you'd have to do if you allocate memory for code that uses SIMD instructions, they require a 16 byte alignment. Consider it a desperation-move though.

To prove the concept of misalignment of objects on heap in .NET you can run following code and you'll see that now it always runs fast. Please don't shoot me, it's just a PoC, but if you are really concerned about performance you might consider using it ;)
public static class AlignedNew
{
public static T New<T>() where T : new()
{
LinkedList<T> candidates = new LinkedList<T>();
IntPtr pointer = IntPtr.Zero;
bool continue_ = true;
int size = Marshal.SizeOf(typeof(T)) % 8;
while( continue_ )
{
if (size == 0)
{
object gap = new object();
}
candidates.AddLast(new T());
GCHandle handle = GCHandle.Alloc(candidates.Last.Value, GCHandleType.Pinned);
pointer = handle.AddrOfPinnedObject();
continue_ = (pointer.ToInt64() % 8) != 0 || (pointer.ToInt64() % 64) == 24;
handle.Free();
if (!continue_)
return candidates.Last.Value;
}
return default(T);
}
}
class Program
{
[StructLayoutAttribute(LayoutKind.Sequential)]
public class Variable
{
public double Value;
}
static void Main()
{
const int COUNT = 10000000;
while (true)
{
var x = AlignedNew.New<Variable>();
for (int inner = 0; inner < 5; ++inner)
{
var stopwatch = Stopwatch.StartNew();
var total = 0.0;
for (int i = 1; i <= COUNT; ++i)
{
x.Value = i;
total += x.Value;
}
if (Math.Abs(total - 50000005000000.0) > 1)
throw new ApplicationException(total.ToString());
Console.Write("{0}, ", stopwatch.ElapsedMilliseconds);
}
Console.WriteLine();
}
}
}

Maybe the StructLayoutAttribute is what you are looking for?

Using struct instead of class, makes the time constant. also consider using StructLayoutAttribute. It helps to specify exact memory layout of a structures. For CLASSES I do not think you have any guarantees how they are layouted in memory.

It will be correctly aligned, otherwise you'd get alignment exceptions on x64. I don't know what your snippet shows, but I wouldn't say anything about alignment from it.

You don't have any control over how .NET lays out your class in memory.
As others have said the StructLayoutAttribute can be used to force a specific memory layout for a struct BUT note that the purpose of this is for C/C++ interop, not for trying to fine-tune the performance of your .NET app.
If you're worried about memory alignment issues then C# is probably the wrong choice of language.
EDIT - Broke out WinDbg and looked at the heap running the code above on 32-bit Vista and .NET 2.0.
Note: I don't get the variation in timings shown above.
0:003> !dumpheap -type Sample+Variable
Address MT Size
01dc2fec 003f3c48 16
01dc54a4 003f3c48 16
01dc58b0 003f3c48 16
01dc5cbc 003f3c48 16
01dc60c8 003f3c48 16
01dc64d4 003f3c48 16
01dc68e0 003f3c48 16
01dc6cd8 003f3c48 16
01dc70e4 003f3c48 16
01dc74f0 003f3c48 16
01dc78e4 003f3c48 16
01dc7cf0 003f3c48 16
01dc80fc 003f3c48 16
01dc8508 003f3c48 16
01dc8914 003f3c48 16
01dc8d20 003f3c48 16
01dc912c 003f3c48 16
01dc9538 003f3c48 16
total 18 objects
Statistics:
MT Count TotalSize Class Name
003f3c48 18 288 TestConsoleApplication.Sample+Variable
Total 18 objects
0:003> !do 01dc9538
Name: TestConsoleApplication.Sample+Variable
MethodTable: 003f3c48
EEClass: 003f15d0
Size: 16(0x10) bytes
(D:\testcode\TestConsoleApplication\bin\Debug\TestConsoleApplication.exe)
Fields:
MT Field Offset Type VT Attr Value Name
6f5746e4 4000001 4 System.Double 1 instance 1655149.000000 Value
This seems to me that the classes' allocation addresses appear to be aligned unless I'm reading this wrong?

Related

Dealing with very large Lists on x86

I need to work with large lists of floats, but I am hitting memory limits on x86 systems. I do not know the final length, so I need to use an expandable type. On x64 systems, I can use <gcAllowVeryLargeObjects>.
My current data type:
List<RawData> param1 = new List<RawData>();
List<RawData> param2 = new List<RawData>();
List<RawData> param3 = new List<RawData>();
public class RawData
{
public string name;
public List<float> data;
}
The length of the paramN lists is low (currently 50 or lower), but data can be 10m+. When the length is 50, I am hitting memory limits (OutOfMemoryException) at just above 1m data points, and when the length is 25, I hit the limit at just above 2m data points. (If my calculations are right, that is exactly 200MB, plus the size of name, plus overhead). What can I use to increase this limit?
Edit: I tried using List<List<float>> with a max inner list size of 1 << 17 (131072), which increased the limit somewhat, but still not as far as I want.
Edit2: I tried reducing the chunk size in the List> to 8192, and I got OOM at ~2.3m elements, with task manager reading ~1.4GB for the process. It looks like I need to reduce memory usage in between the data source and the storage, or trigger GC more often - I was able to gather 10m data points in a x64 process on a pc with 4GB RAM, IIRC the process never went over 3GB
Edit3: I condensed my code down to just the parts that handle the data. http://pastebin.com/maYckk84
Edit4: I had a look in DotMemory, and found that my data structure does take up ~1GB with the settings I was testing on (50ch * 3 params * 2m events = 300,000,000 float elements). I guess I will need to limit it on x86 or figure out how to write to disk in this format as I get data
First of all, on x86 systems memory limit is 2GB, not 200MB. I presume
your problem is much more trickier than that. You have aggressive LOH (large object heap) fragmentation.
CLR uses different heaps for small and large objects. Object is large if its size is larger than 85,000 bytes. LOH is a very fractious thing, it is not eager to return unused memory back to OS, and it is very poor at defragmentation.
.Net List is implementation of ArrayList data structure, it stores elements in array, which has fixed size; when array is filled, new array with doubled size is created. That continuous growth of array with your amount of data is a "starvation" scenario for LOH.
So, you have to use tailor-made data structure to suit your needs. E.g. list of chunks, with each chunk is small enough not to get into LOH. Here is small prototype:
public class ChunkedList
{
private readonly List<float[]> _chunks = new List<float[]>();
private const int ChunkSize = 8000;
private int _count = 0;
public void Add(float item)
{
int chunk = _count / ChunkSize;
int ind = _count % ChunkSize;
if (ind == 0)
{
_chunks.Add(new float[ChunkSize]);
}
_chunks[chunk][ind] = item;
_count ++;
}
public float this[int index]
{
get
{
if(index <0 || index >= _count) throw new IndexOutOfRangeException();
int chunk = index / ChunkSize;
int ind = index % ChunkSize;
return _chunks[chunk][ind];
}
set
{
if(index <0 || index >= _count) throw new IndexOutOfRangeException();
int chunk = index / ChunkSize;
int ind = index % ChunkSize;
_chunks[chunk][ind] = value;
}
}
//other code you require
}
With ChunkSize = 8000 every chunk will take only 32,000 bytes, so it will not get into LOH. _chunks will get into LOH only when there will be about 16,000 chunks in collection, which is more than 128 million elements in collection (about 500 MB).
UPD I've performed some stress tests for sample above. OS is x64, solution platform is x86. ChunkSize is 20000.
First:
var list = new ChunkedList();
for (int i = 0; ; i++)
{
list.Add(0.1f);
}
OutOfMemoryException is raised at ~324,000,000 elements
Second:
public class RawData
{
public string Name;
public ChunkedList Data = new ChunkedList();
}
var list = new List<RawData>();
for (int i = 0;; i++)
{
var raw = new RawData { Name = "Test" + i };
for (int j = 0; j < 20 * 1000 * 1000; j++)
{
raw.Data.Add(0.1f);
}
list.Add(raw);
}
OutOfMemoryException is raised at i=17, j~12,000,000. 17 RawData instances successfully created, 20 million data points per each, about 352 million data points totally.

Defining a bit[] array in C#

currently im working on a solution for a prime-number calculator/checker. The algorythm is already working and verry efficient (0,359 seconds for the first 9012330 primes). Here is a part of the upper region where everything is declared:
const uint anz = 50000000;
uint a = 3, b = 4, c = 3, d = 13, e = 12, f = 13, g = 28, h = 32;
bool[,] prim = new bool[8, anz / 10];
uint max = 3 * (uint)(anz / (Math.Log(anz) - 1.08366));
uint[] p = new uint[max];
Now I wanted to go to the next level and use ulong's instead of uint's to cover a larger area (you can see that already), where i tapped into my problem: the bool-array.
Like everybody should know, bool's have the length of a byte what takes a lot of memory when creating the array... So I'm searching for a more resource-friendly way to do that.
My first idea was a bit-array -> not byte! <- to save the bool's, but haven't figured out how to do that by now. So if someone ever did something like this, I would appreciate any kind of tips and solutions. Thanks in advance :)
You can use BitArray collection:
http://msdn.microsoft.com/en-us/library/system.collections.bitarray(v=vs.110).aspx
MSDN Description:
Manages a compact array of bit values, which are represented as Booleans, where true indicates that the bit is on (1) and false indicates the bit is off (0).
You can (and should) use well tested and well known libraries.
But if you're looking to learn something (as it seems to be the case) you can do it yourself.
Another reason you may want to use a custom bit array is to use the hard drive to store the array, which comes in handy when calculating primes. To do this you'd need to further split addr, for example lowest 3 bits for the mask, next 28 bits for 256MB of in-memory storage, and from there on - a file name for a buffer file.
Yet another reason for custom bit array is to compress the memory use when specifically searching for primes. After all more than half of your bits will be 'false' because the numbers corresponding to them would be even, so in fact you can both speed up your calculation AND reduce memory requirements if you don't even store the even bits. You can do that by changing the way addr is interpreted. Further more you can also exclude numbers divisible by 3 (only 2 out of every 6 numbers has a chance of being prime) thus reducing memory requirements by 60% compared to plain bit array.
Notice the use of shift and logical operators to make the code a bit more efficient.
byte mask = (byte)(1 << (int)(addr & 7)); for example can be written as
byte mask = (byte)(1 << (int)(addr % 8));
and addr >> 3 can be written as addr / 8
Testing shift/logical operators vs division shows 2.6s vs 4.8s in favor of shift/logical for 200000000 operations.
Here's the code:
void Main()
{
var barr = new BitArray(10);
barr[4] = true;
Console.WriteLine("Is it "+barr[4]);
Console.WriteLine("Is it Not "+barr[5]);
}
public class BitArray{
private readonly byte[] _buffer;
public bool this[long addr]{
get{
byte mask = (byte)(1 << (int)(addr & 7));
byte val = _buffer[(int)(addr >> 3)];
bool bit = (val & mask) == mask;
return bit;
}
set{
byte mask = (byte) ((value ? 1:0) << (int)(addr & 7));
int offs = (int)addr >> 3;
_buffer[offs] = (byte)(_buffer[offs] | mask);
}
}
public BitArray(long size){
_buffer = new byte[size/8 + 1]; // define a byte buffer sized to hold 8 bools per byte. The spare +1 is to avoid dealing with rounding.
}
}

C to C# Bytearray + hex

I'm currently trying to get this C code converted into C#.
Since I'm not really familiar with C I'd really apprecheate your help!
static unsigned char byte_table[2080] = {0};
First of, some bytearray gets declared but never filled which I'm okay with
BYTE* packet = //bytes come in here from a file
int unknownVal = 0;
int unknown_field0 = *(DWORD *)(packet + 0x08);
do
{
*((BYTE *)packet + i) ^= byte_table[(i + unknownVal) & 0x7FF];
++i;
}
while (i <= packet[0]);
But down here.. I really have no idea how to translate this into C#
BYTE = byte[] right?
DWORD = double?
but how can (packet + 0x08) be translated? How can I add a hex to a bytearray? Oo
I'd be happy about anything that helps! :)
In C, setting any set of memory to {0} will set the entire memory area to zeroes, if I'm not mistaken.
That bottom loop can be rewritten in a simpler, C# friendly fashion.
byte[] packet = arrayofcharsfromfile;
int field = packet[8]+(packet[9]<<8)+(packet[10]<<16)+(packet[11]<<24); //Assuming 32 bit little endian integer
int unknownval = 0;
int i = 0;
do //Why waste the newline? I don't know. Conventions are silly!
{
packet[i] ^= byte_table[(i+unknownval) & 0x7FF];
} while( ++i <= packet[0] );
field is set by taking the four bytes including and following index 8 and generating a 32 bit int from them.
In C, you can cast pointers to other types, as is done in your provided snippet. What they're doing is taking an array of bytes (each one 1/4 the size of a DWORD) and adding 8 to the index which advances the pointer by 8 bytes (since each element is a byte wide) and then treating that pointer as a DWORD pointer. In simpler terms, they're turning the byte array in to a DWORD array, and then taking index 2, as 8/4=2.
You can simulate this behavior in a safe fashion by stringing the bytes together with bitshifting and addition, as I demonstrated above. It's not as efficient and isn't as pretty, but it accomplishes the same thing, and in a platform agnostic way too. Not all platforms are little endian.

What is the fastest way I can compare two equal-size bitmaps to determine whether they are identical?

I am trying to write a function to determine whether two equal-size bitmaps are identical or not. The function I have right now simply compares a pixel at a time in each bitmap, returning false at the first non-equal pixel.
While this works, and works well for small bitmaps, in production I'm going to be using this in a tight loop and on larger images, so I need a better way. Does anyone have any recommendations?
The language I'm using is C# by the way - and yes, I am already using the .LockBits method. =)
Edit: I've coded up implementations of some of the suggestions given, and here are the benchmarks. The setup: two identical (worst-case) bitmaps, 100x100 in size, with 10,000 iterations each. Here are the results:
CompareByInts (Marc Gravell) : 1107ms
CompareByMD5 (Skilldrick) : 4222ms
CompareByMask (GrayWizardX) : 949ms
In CompareByInts and CompareByMask I'm using pointers to access the memory directly; in the MD5 method I'm using Marshal.Copy to retrieve a byte array and pass that as an argument to MD5.ComputeHash. CompareByMask is only slightly faster, but given the context I think any improvement is useful.
Thanks everyone. =)
Edit 2: Forgot to turn optimizations on - doing that gives GrayWizardX's answer even more of a boost:
CompareByInts (Marc Gravell) : 944ms
CompareByMD5 (Skilldrick) : 4275ms
CompareByMask (GrayWizardX) : 630ms
CompareByMemCmp (Erik) : 105ms
Interesting that the MD5 method didn't improve at all.
Edit 3: Posted my answer (MemCmp) which blew the other methods out of the water. o.O
Edit 8-31-12: per Joey's comment below, be mindful of the format of the bitmaps you compare. They may contain padding on the strides that render the bitmaps unequal, despite being equivalent pixel-wise. See this question for more details.
Reading this answer to a question regarding comparing byte arrays has yielded a MUCH FASTER method: using P/Invoke and the memcmp API call in msvcrt. Here's the code:
[DllImport("msvcrt.dll")]
private static extern int memcmp(IntPtr b1, IntPtr b2, long count);
public static bool CompareMemCmp(Bitmap b1, Bitmap b2)
{
if ((b1 == null) != (b2 == null)) return false;
if (b1.Size != b2.Size) return false;
var bd1 = b1.LockBits(new Rectangle(new Point(0, 0), b1.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
var bd2 = b2.LockBits(new Rectangle(new Point(0, 0), b2.Size), ImageLockMode.ReadOnly, PixelFormat.Format32bppArgb);
try
{
IntPtr bd1scan0 = bd1.Scan0;
IntPtr bd2scan0 = bd2.Scan0;
int stride = bd1.Stride;
int len = stride * b1.Height;
return memcmp(bd1scan0, bd2scan0, len) == 0;
}
finally
{
b1.UnlockBits(bd1);
b2.UnlockBits(bd2);
}
}
If you are trying to determine if they are 100% equal, you can invert one and add it to the other if its zero they are identical. Extending this using unsafe code, take 64 bits at a time as a long and do the math that way, any differences can cause an immediate fail.
If the images are not 100% identical (comparing png to jpeg), or if you are not looking for a 100% match then you have some more work ahead of you.
Good luck.
Well, you're using .LockBits, so presumably you're using unsafe code. Rather than treating each row origin (Scan0 + y * Stride) as a byte*, consider treating it as an int*; int arithmetic is pretty quick, and you only have to do 1/4 as much work. And for images in ARGB you might still be talking in pixels, making the math simple.
Could you take a hash of each and compare? It would be slightly probabilistic, but practically not.
Thanks to Ram, here's a sample implementation of this technique.
If the original problem is just to find the exact duplicates among two bitmaps, then just a bit level comparison will have to do. I don't know C# but in C I would use the following function:
int areEqual (long size, long *a, long *b)
{
long start = size / 2;
long i;
for (i = start; i != size; i++) { if (a[i] != b[i]) return 0 }
for (i = 0; i != start; i++) { if (a[i] != b[i]) return 0 }
return 1;
}
I would start looking in the middle because I suspect there is a much better chance of finding unequal bits near the middle of the image than the beginning; of course, this would really depend on the images you are deduping, selecting a random place to start may be best.
If you are trying to find the exact duplicates among hundreds of images then comparing all pairs of them is unnecessary. First compute the MD5 hash of each image and place it in a list of pairs (md5Hash, imageId); then sort the list by the m5Hash. Next, only do pairwise comparisons on the images that have the same md5Hash.
If these bitmaps are already on your graphics card then you can parallelize such a check by doing it on the graphics card using a language like CUDA or OpenCL.
I'll explain in terms of CUDA, since that's the one I know. Basically CUDA lets you write general purpose code to run in parallel across each node of your graphics card. You can access bitmaps that are in shared memory. Each invocation of the function is also given an index within the set of parallel runs. So, for a problem like this, you'd just run one of the above comparison functions for some subset of the bitmap - using parallelization to cover the entire bitmap. Then, just write a 1 to a certain memory location if the comparison fails (and write nothing if it succeeds).
If you don't already have the bitmaps on your graphics card, this probably isn't the way to go, since the costs for loading the two bitmaps on your card will easily eclipse the savings such parallelization will gain you.
Here's some (pretty bad) example code (it's been a little while since I programmed CUDA). There's better ways to access bitmaps that are already loaded as textures, but I didn't bother here.
// kernel to run on GPU, once per thread
__global__ void compare_bitmaps(long const * const A, long const * const B, char * const retValue, size_t const len)
{
// divide the work equally among the threads (each thread is in a block, each block is in a grid)
size_t const threads_per_block = blockDim.x * blockDim.y * blockDim.z;
size_t const len_to_compare = len / (gridDim.x * gridDim.y * gridDim.z * threads_per_block);
# define offset3(idx3,dim3) (idx3.x + dim3.x * (idx3.y + dim3.y * idx3.z))
size_t const start_offset = len_to_compare * (offset3(threadIdx,blockDim) + threads_per_block * offset3(blockIdx,gridDim));
size_t const stop_offset = start_offset + len_to_compare;
# undef offset3
size_t i;
for (i = start_offset; i < stop_offset; i++)
{
if (A[i] != B[i])
{
*retValue = 1;
break;
}
}
return;
}
If you can implement something like Duff's Device in your language, that might give you a significant speed boost over a simple loop. Usually it's used for copying data, but there's no reason it can't be used for comparing data instead.
Or, for that matter, you may just want to use some equivalent to memcmp().
You could try to add them to a database "blob" then use the database engine to compare their binaries. This would only give you a yes or no answer to whether the binary data is the same. It would be very easy to make 2 images that produce the same graphic but have different binary though.
You could also select a few random pixels and compare them, then if they are the same continue with more until you've checked all the pixels. This would only return a faster negative match though, it still would take as long to find 100% positive matches
Based on the approach of comparing hashes instead of comparing every single pixel, this is what I use:
public static class Utils
{
public static byte[] ShaHash(this Image image)
{
var bytes = new byte[1];
bytes = (byte[])(new ImageConverter()).ConvertTo(image, bytes.GetType());
return (new SHA256Managed()).ComputeHash(bytes);
}
public static bool AreEqual(Image imageA, Image imageB)
{
if (imageA.Width != imageB.Width) return false;
if (imageA.Height != imageB.Height) return false;
var hashA = imageA.ShaHash();
var hashB = imageB.ShaHash();
return !hashA
.Where((nextByte, index) => nextByte != hashB[index])
.Any();
}
]
Usage is straight forward:
bool isMatch = Utils.AreEqual(bitmapOne, bitmapTwo);

C data structure to mimic C#'s List<List<int>>?

I am looking to refactor a c# method into a c function in an attempt to gain some speed, and then call the c dll in c# to allow my program to use the functionality.
Currently the c# method takes a list of integers and returns a list of lists of integers. The method calculated the power set of the integers so an input of 3 ints would produce the following output (at this stage the values of the ints is not important as it is used as an internal weighting value)
1
2
3
1,2
1,3
2,3
1,2,3
Where each line represents a list of integers. The output indicates the index (with an offset of 1) of the first list, not the value. So 1,2 indicates that the element at index 0 and 1 are an element of the power set.
I am unfamiliar with c, so what are my best options for data structures that will allow the c# to access the returned data?
Thanks in advance
Update
Thank you all for your comments so far. Here is a bit of a background to the nature of the problem.
The iterative method for calculating the power set of a set is fairly straight forward. Two loops and a bit of bit manipulation is all there is to it really. It just get called..a lot (in fact billions of times if the size of the set is big enough).
My thoughs around using c (c++ as people have pointed out) are that it gives more scope for performance tuning. A direct port may not offer any increase, but it opens the way for more involved methods to get a bit more speed out of it. Even a small increase per iteration would equate to a measurable increase.
My idea was to port a direct version and then work to increase it. And then refactor it over time (with help from everyone here at SO).
Update 2
Another fair point from jalf, I dont have to use list or equivelent. If there is a better way then I am open to suggestions. The only reason for the list was that each set of results is not the same size.
The code so far...
public List<List<int>> powerset(List<int> currentGroupList)
{
_currentGroupList = currentGroupList;
int max;
int count;
//Count the objects in the group
count = _currentGroupList.Count;
max = (int)Math.Pow(2, count);
//outer loop
for (int i = 0; i < max; i++)
{
_currentSet = new List<int>();
//inner loop
for (int j = 0; j < count; j++)
{
if ((i & (1 << j)) == 0)
{
_currentSetList.Add(_currentGroupList.ElementAt(j));
}
}
outputList.Add(_currentSetList);
}
return outputList;
}
As you can see, not a lot to it. It just goes round and round a lot!
I accept that the creating and building of lists may not be the most efficient way, but I need some way of providing the results back in a manageable way.
Update 2
Thanks for all the input and implementation work. Just to clarify a couple of points raised: I dont need the output to be in 'natural order', and also I am not that interested in the empty set being returned.
hughdbrown's implementation is intesting but i think that i will need to store the results (or at least a subset of them) at some point. It sounds like memory limitiations will apply long before running time becomes a real issue.
Partly because of this, I think I can get away with using bytes instead of integers, giving more potential storage.
The question really is then: Have we reached the maximum speed for this calcualtion in C#? Does the option of unmanaged code provide any more scope. I know in many respects the answer is futile, as even if we havled the time to run, it would only allow an extra values in the original set.
Also, be sure that moving to C/C++ is really what you need to do for speed to begin with. Instrument the original C# method (standalone, executed via unit tests), instrument the new C/C++ method (again, standalone via unit tests) and see what the real world difference is.
The reason I bring this up is that I fear it may be a pyrhhic victory -- using Smokey Bacon's advice, you get your list class, you're in "faster" C++, but there's still a cost to calling that DLL: Bouncing out of the runtime with P/Invoke or COM interop carries a fairly substantial performance cost.
Be sure you're getting your "money's worth" out of that jump before you do it.
Update based on the OP's Update
If you're calling this loop repeatedly, you need to absolutely make sure that the entire loop logic is encapsulated in a single interop call -- otherwise the overhead of marshalling (as others here have mentioned) will definitely kill you.
I do think, given the description of the problem, that the issue isn't that C#/.NET is "slower" than C, but more likely that the code needs to be optimized. As another poster here mentioned, you can use pointers in C# to seriously boost performance in this kind of loop, without the need for marshalling. I'd look into that first, before jumping into a complex interop world, for this scenario.
If you are looking to use C for a performance gain, most likely you are planning to do so through the use of pointers. C# does allow for use of pointers, using the unsafe keyword. Have you considered that?
Also how will you be calling this code.. will it be called often (e.g. in a loop?) If so, marshalling the data back and forth may more than offset any performance gains.
Follow Up
Take a look at Native code without sacrificing .NET performance for some interop options. There are ways to interop without too much of a performance loss, but those interops can only happen with the simplest of data types.
Though I still think that you should investigate speeding up your code using straight .NET.
Follow Up 2
Also, may I suggest that if you have your heart set on mixing native code and managed code, that you create your library using c++/cli. Below is a simple example. Note that I am not a c++/cli guy, and this code doesn't do anything useful...its just meant to show how easily you can mix native and managed code.
#include "stdafx.h"
using namespace System;
System::Collections::Generic::List<int> ^MyAlgorithm(System::Collections::Generic::List<int> ^sourceList);
int main(array<System::String ^> ^args)
{
System::Collections::Generic::List<int> ^intList = gcnew System::Collections::Generic::List<int>();
intList->Add(1);
intList->Add(2);
intList->Add(3);
intList->Add(4);
intList->Add(5);
Console::WriteLine("Before Call");
for each(int i in intList)
{
Console::WriteLine(i);
}
System::Collections::Generic::List<int> ^modifiedList = MyAlgorithm(intList);
Console::WriteLine("After Call");
for each(int i in modifiedList)
{
Console::WriteLine(i);
}
}
System::Collections::Generic::List<int> ^MyAlgorithm(System::Collections::Generic::List<int> ^sourceList)
{
int* nativeInts = new int[sourceList->Count];
int nativeIntArraySize = sourceList->Count;
//Managed to Native
for(int i=0; i<sourceList->Count; i++)
{
nativeInts[i] = sourceList[i];
}
//Do Something to native ints
for(int i=0; i<nativeIntArraySize; i++)
{
nativeInts[i]++;
}
//Native to Managed
System::Collections::Generic::List<int> ^returnList = gcnew System::Collections::Generic::List<int>();
for(int i=0; i<nativeIntArraySize; i++)
{
returnList->Add(nativeInts[i]);
}
return returnList;
}
What makes you think you'll gain speed by calling into C code? C isn't magically faster than C#. It can be, of course, but it can also easily be slower (and buggier). Especially when you factor in the p/invoke calls into native code, it's far from certain that this approach will speed up anything.
In any case, C doesn't have anything like List. It has raw arrays and pointers (and you could argue that int** is more or less equivalent), but you're probably better off using C++, which does have equivalent datastructures. In particular, std::vector.
There are no simple ways to expose this data to C# however, since it will be scattered pretty much randomly (each list is a pointer to some dynamically allocated memory somewhere)
However, I suspect the biggest performance improvement comes from improving the algorithm in C#.
Edit:
I can see several things in your algorithm that seem suboptimal. Constructing a list of lists isn't free. Perhaps you can create a single list and use different offsets to represent each sublist. Or perhaps using 'yield return' and IEnumerable instead of explicitly constructing lists might be faster.
Have you profiled your code, found out where the time is being spent?
This returns one set of a powerset at a time. It is based on python code here. It works for powersets of over 32 elements. If you need fewer than 32, you can change long to int. It is pretty fast -- faster than my previous algorithm and faster than (my modified to use yield return version of) P Daddy's code.
static class PowerSet4<T>
{
static public IEnumerable<IList<T>> powerset(T[] currentGroupList)
{
int count = currentGroupList.Length;
Dictionary<long, T> powerToIndex = new Dictionary<long, T>();
long mask = 1L;
for (int i = 0; i < count; i++)
{
powerToIndex[mask] = currentGroupList[i];
mask <<= 1;
}
Dictionary<long, T> result = new Dictionary<long, T>();
yield return result.Values.ToArray();
long max = 1L << count;
for (long i = 1L; i < max; i++)
{
long key = i & -i;
if (result.ContainsKey(key))
result.Remove(key);
else
result[key] = powerToIndex[key];
yield return result.Values.ToArray();
}
}
}
You can download all the fastest versions I have tested here.
I really think that using yield return is the change that makes calculating large powersets possible. Allocating large amounts of memory upfront increases runtime dramatically and causes algorithms to fail for lack of memory very early on. Original Poster should figure out how many sets of a powerset he needs at once. Holding all of them is not really an option with >24 elements.
I'm also going to put in a vote for tuning-up your C#, particularly by going to 'unsafe' code and losing what might be a lot of bounds-checking overhead.
Even though it's 'unsafe', it's no less 'safe' than C/C++, and it's dramatically easier to get right.
Below is a C# algorithm that should be much faster (and use less memory) than the algorithm you posted. It doesn't use the neat binary trick yours uses, and as a result, the code is a good bit longer. It has a few more for loops than yours, and might take a time or two stepping through it with the debugger to fully grok it. But it's actually a simpler approach, once you understand what it's doing.
As a bonus, the returned sets are in a more "natural" order. It would return subsets of the set {1 2 3} in the same order you listed them in your question. That wasn't a focus, but is a side effect of the algorithm used.
In my tests, I found this algorithm to be approximately 4 times faster than the algorithm you posted for a large set of 22 items (which was as large as I could go on my machine without excessive disk-thrashing skewing the results too much). One run of yours took about 15.5 seconds, and mine took about 3.6 seconds.
For smaller lists, the difference is less pronounced. For a set of only 10 items, yours ran 10,000 times in about 7.8 seconds, and mine took about 3.2 seconds. For sets with 5 or fewer items, they run close to the same time. With many iterations, yours runs a little faster.
Anyway, here's the code. Sorry it's so long; I tried to make sure I commented it well.
/*
* Made it static, because it shouldn't really use or modify state data.
* Making it static also saves a tiny bit of call time, because it doesn't
* have to receive an extra "this" pointer. Also, accessing a local
* parameter is a tiny bit faster than accessing a class member, because
* dereferencing the "this" pointer is not free.
*
* Made it generic so that the same code can handle sets of any type.
*/
static IList<IList<T>> PowerSet<T>(IList<T> set){
if(set == null)
throw new ArgumentNullException("set");
/*
* Caveat:
* If set.Count > 30, this function pukes all over itself without so
* much as wiping up afterwards. Even for 30 elements, though, the
* result set is about 68 GB (if "set" is comprised of ints). 24 or
* 25 elements is a practical limit for current hardware.
*/
int setSize = set.Count;
int subsetCount = 1 << setSize; // MUCH faster than (int)Math.Pow(2, setSize)
T[][] rtn = new T[subsetCount][];
/*
* We don't really need dynamic list allocation. We can calculate
* in advance the number of subsets ("subsetCount" above), and
* the size of each subset (0 through setSize). The performance
* of List<> is pretty horrible when the initial size is not
* guessed well.
*/
int subsetIndex = 0;
for(int subsetSize = 0; subsetSize <= setSize; subsetSize++){
/*
* The "indices" array below is part of how we implement the
* "natural" ordering of the subsets. For a subset of size 3,
* for example, we initialize the indices array with {0, 1, 2};
* Later, we'll increment each index until we reach setSize,
* then carry over to the next index. So, assuming a set size
* of 5, the second iteration will have indices {0, 1, 3}, the
* third will have {0, 1, 4}, and the fifth will involve a carry,
* so we'll have {0, 2, 3}.
*/
int[] indices = new int[subsetSize];
for(int i = 1; i < subsetSize; i++)
indices[i] = i;
/*
* Now we'll iterate over all the subsets we need to make for the
* current subset size. The number of subsets of a given size
* is easily determined with combination (nCr). In other words,
* if I have 5 items in my set and I want all subsets of size 3,
* I need 5-pick-3, or 5C3 = 5! / 3!(5 - 3)! = 10.
*/
for(int i = Combination(setSize, subsetSize); i > 0; i--){
/*
* Copy the items from the input set according to the
* indices we've already set up. Alternatively, if you
* just wanted the indices in your output, you could
* just dup the index array here (but make sure you dup!
* Otherwise the setup step at the bottom of this for
* loop will mess up your output list! You'll also want
* to change the function's return type to
* IList<IList<int>> in that case.
*/
T[] subset = new T[subsetSize];
for(int j = 0; j < subsetSize; j++)
subset[j] = set[indices[j]];
/* Add the subset to the return */
rtn[subsetIndex++] = subset;
/*
* Set up indices for next subset. This looks a lot
* messier than it is. It simply increments the
* right-most index until it overflows, then carries
* over left as far as it needs to. I've made the
* logic as fast as I could, which is why it's hairy-
* looking. Note that the inner for loop won't
* actually run as long as a carry isn't required,
* and will run at most once in any case. The outer
* loop will go through as few iterations as required.
*
* You may notice that this logic doesn't check the
* end case (when the left-most digit overflows). It
* doesn't need to, since the loop up above won't
* execute again in that case, anyway. There's no
* reason to waste time checking that here.
*/
for(int j = subsetSize - 1; j >= 0; j--)
if(++indices[j] <= setSize - subsetSize + j){
for(int k = j + 1; k < subsetSize; k++)
indices[k] = indices[k - 1] + 1;
break;
}
}
}
return rtn;
}
static int Combination(int n, int r){
if(r == 0 || r == n)
return 1;
/*
* The formula for combination is:
*
* n!
* ----------
* r!(n - r)!
*
* We'll actually use a slightly modified version here. The above
* formula forces us to calculate (n - r)! twice. Instead, we only
* multiply for the numerator the factors of n! that aren't canceled
* out by (n - r)! in the denominator.
*/
/*
* nCr == nC(n - r)
* We can use this fact to reduce the number of multiplications we
* perform, as well as the incidence of overflow, where r > n / 2
*/
if(r > n / 2) /* We DO want integer truncation here (7 / 2 = 3) */
r = n - r;
/*
* I originally used all integer math below, with some complicated
* logic and another function to handle cases where the intermediate
* results overflowed a 32-bit int. It was pretty ugly. In later
* testing, I found that the more generalized double-precision
* floating-point approach was actually *faster*, so there was no
* need for the ugly code. But if you want to see a giant WTF, look
* at the edit history for this post!
*/
double denominator = Factorial(r);
double numerator = n;
while(--r > 0)
numerator *= --n;
return (int)(numerator / denominator + 0.1/* Deal with rounding errors. */);
}
/*
* The archetypical factorial implementation is recursive, and is perhaps
* the most often used demonstration of recursion in text books and other
* materials. It's unfortunate, however, that few texts point out that
* it's nearly as simple to write an iterative factorial function that
* will perform better (although tail-end recursion, if implemented by
* the compiler, will help to close the gap).
*/
static double Factorial(int x){
/*
* An all-purpose factorial function would handle negative numbers
* correctly - the result should be Sign(x) * Factorial(Abs(x)) -
* but since we don't need that functionality, we're better off
* saving the few extra clock cycles it would take.
*/
/*
* I originally used all integer math below, but found that the
* double-precision floating-point version is not only more
* general, but also *faster*!
*/
if(x < 2)
return 1;
double rtn = x;
while(--x > 1)
rtn *= x;
return rtn;
}
Your list of results does not match the results your code would produce. In particular, you do not show generating the empty set.
If I were producing powersets that could have a few billion subsets, then generating each subset separately rather than all at once might cut down on your memory requirements, improving your code's speed. How about this:
static class PowerSet<T>
{
static long[] mask = { 1L << 0, 1L << 1, 1L << 2, 1L << 3,
1L << 4, 1L << 5, 1L << 6, 1L << 7,
1L << 8, 1L << 9, 1L << 10, 1L << 11,
1L << 12, 1L << 13, 1L << 14, 1L << 15,
1L << 16, 1L << 17, 1L << 18, 1L << 19,
1L << 20, 1L << 21, 1L << 22, 1L << 23,
1L << 24, 1L << 25, 1L << 26, 1L << 27,
1L << 28, 1L << 29, 1L << 30, 1L << 31};
static public IEnumerable<IList<T>> powerset(T[] currentGroupList)
{
int count = currentGroupList.Length;
long max = 1L << count;
for (long iter = 0; iter < max; iter++)
{
T[] list = new T[count];
int k = 0, m = -1;
for (long i = iter; i != 0; i &= (i - 1))
{
while ((mask[++m] & i) == 0)
;
list[k++] = currentGroupList[m];
}
yield return list;
}
}
}
Then your client code looks like this:
static void Main(string[] args)
{
int[] intList = { 1, 2, 3, 4 };
foreach (IList<int> set in PowerSet<int>.powerset(intList))
{
foreach (int i in set)
Console.Write("{0} ", i);
Console.WriteLine();
}
}
I'll even throw in a bit-twiddling algorithm with templated arguments for free. For added speed, you can wrap the powerlist() inner loop in an unsafe block. It doesn't make much difference.
On my machine, this code is slightly slower than the OP's code until the sets are 16 or larger. However, all times to 16 elements are less than 0.15 seconds. At 23 elements, it runs in 64% of the time. The original algorithm does not run on my machine for 24 or more elements -- it runs out of memory.
This code takes 12 seconds to generate the power set for the numbers 1 to 24, omitting screen I/O time. That's 16 million-ish in 12 seconds, or about 1400K per second. For a billion (which is what you quoted earlier), that would be about 760 seconds. How long do you think this should take?
Does it have to be C, or is C++ an option too? If C++, you can just its own list type from the STL. Otherwise, you'll have to implement your own list - look up linked lists or dynamically sized arrays for pointers on how to do this.
I concur with the "optimize .NET first" opinion. It's the most painless. I imagine that if you wrote some unmanaged .NET code using C# pointers, it'd be identical to C execution, except for the VM overhead.
P Daddy:
You could change your Combination() code to this:
static long Combination(long n, long r)
{
r = (r > n - r) ? (n - r) : r;
if (r == 0)
return 1;
long result = 1;
long k = 1;
while (r-- > 0)
{
result *= n--;
result /= k++;
}
return result;
}
This will reduce the multiplications and the chance of overflow to a minimum.

Categories