Why is z always zero in CUDA kernel

Why is z always zero in CUDA kernel - c#

I am using Cudafy to do some calculations on a NVIDIA GPU.
(Quadro K1100M capability 3.0, if it matters)
My question is, when I use the following
cudaGpu.Launch(new dim3(44,8,num), new dim(8, 8)).MyKernel...
why are my z indexes from the GThread instance always zero when I use this in my kernel?
int z = thread.blockIdx.z * thread.blockDim.z + thread.threadIdx.z;
Furthermore, if I have to do something like
cudaGpu.Launch(new dim3(44,8,num), new dim(8, 8, num)).MyKernel...
z does give different indexes as it should, but num can't be very large because of the restrictions on number of threads per block. Any surgestion on how to work around this?
Edit
Another way to phrase it. Can I use thread.z in my kernel (for anything useful) when block size is only 2D?

On all currently supported hardware, CUDA allows the use of both three dimensional grids and three dimensional blocks. On compute capability 1.x devices (which are no longer supported), grids were restricted to two dimensions.
However, CUDAfy currently uses a deprecated runtime API function to launch kernels, and silently uses only gridDim.x and gridDim.y, not taking gridDim.z in account :
_cuda.Launch(function, gridSize.x, gridSize.y);
As seen in the function DoLaunch() in CudaGPU.cs.
So while you can specify a three dimensional grid in CUDAfy, the third dimension is ignored during the kernel launch. Thanks to Florent for pointing this out !

Related

Declaring a jagged array succeeds, but out of memory when declaring a multi-dimen array of same size

I get an out of memory exception when running this line of code:
double[,] _DataMatrix = new double[_total_traces, _samples_per_trace];
But this code completes successfully:
double[][] _DataMatrix = new double[_total_traces][];
for (int i = 0; i < _total_traces; i++)
{
_DataMatrix[i] = new double[_samples_per_trace];
}
My first question is why is this happening?
As a followup question, my ultimate goal is to run Principal Component Analysis (PCA) on this data. It's a pretty large dataset. The number of "rows" in the matrix could be a couple million. The number of "columns" will be around 50. I found a PCA library in the Accord.net framework that seems popular. It takes a jagged array as input (which I can successfully create and populate with data), but I run out of memory when I pass it to PCA - I guess because it is passing by value and creating a copy of the data(?). My next thought was to just write my own method to do the PCA so I wouldn't have to copy the data, but I haven't got that far yet. I haven't really had to deal with memory management much before, so I'm open to tips.
Edit: This is not a duplicate of the topic linked below, because that link did not explain how the memory of the two was stored differently and why one would cause memory issues despite them both being the same size.

In 32bits it is complex to have a continuous range of addresses of more than some hundred mb (see for example https://stackoverflow.com/a/30035977/613130). But it is easy to have scattered pieces of memory totalling some hundred mb (or even 1gb)...
The multidimensional array is a single slab of continuous memory, the jagged array is a collection of small arrays (so of small pieces of memory).
Note that in 64bits it is much easier to create an array of the maximum size permitted by .NET (around 2gb or even more... see https://stackoverflow.com/a/2338797/613130)

(C# / Unity) Fastest way to traverse a 2D array or fastest way to find an integer in an array of integers (Cellular Automata)

(Question is in C#, what I'm stating about Unity and Cellular Automata is to provide context)
I'm writing a Cellular Automata system and I need to update it with new generation very fast in every 0.1 seconds.
Each cell is displayed with a GameObject on the scene, which is cached.
Problem is, when I want to find whether the current cell should survive or not, I need to search an integer in an array of integers (which is number of neighbors thas cells has) which is very slow.
Note that this happens 8000+ times per 0.1 second.
I've used Array and Index.Of() static function which was very slow. Next I found out about HashSet and it's better but not fast enough since I still find FPS drops due to hiccups on calling HashSet.Contains() to do my search.
The array of integers is very small actually, about 5 integers at most (I actually need two of these sets for both survive and born rules mind you).
So I was wondering if there is a faster way to do this search in C#.
Thanks.
EDIT:
I found out Lists are faster for small amount of items like in my case. I've tested and it's better but still lags.

How to rewrite MATLAB pmtm in Mathematica or C#

How can I rewrite MATLAB pmtm function in Mathematica or C# (.NET 4.0)?
I am using pmtm in this way:
[p,f] = pmtm(data,tapers,n,fs);
Alternatively written without pmtm using spectrum.mtm and psd.
Hs = spectrum.mtm(tapers,'adapt');
powerspectrum = psd(Hs,data,'Fs',fs,'NFFT',n);
p = powerspectrum.data;
f = powerspectrum.Frequencies;
Where data is a column vector with 2048 elements, fs = 40, tapers = 8 and n = 2^nextpow2(size(data,1)) = 2048;
Thanks.

The pmtm (multitaper method) is a non-parametric method for computing a power spectrum similar to the periodogram approach.
In this method a power spectrum is computed by windowing the data and computing a Fourier transform, taking the magnitude of the result and squaring it. The multitaper method averages a pre-determined number of periodograms each computed with a different window. This method works because the selected windows have two mathematical properties. First, the windows are orthogonal. This means that each one of the periodograms is uncorrelated so averaging multiple periodograms gives an estimate with a lower variance than using just one taper. Second, the windows have the best possible concentration in the frequency domain for a fixed signal length. This means that these windows perform the best possible with respect to leakage.
Mathematica has package for time series that contains funtions like PowerSpectralDensity.
If you have further problems ask your question in https://mathematica.stackexchange.com/

Extract a vector from a two dimensional array efficiently in C#

I have a very large two dimensional array and I need to compute vector operations on this array. NTerms and NDocs are both very large integers.
var myMat = new double[NTerms, NDocs];
I need to to extract vector columns from this matrix. Currently, I'm using for loops.
col = 100;
for (int i = 0; i < NTerms; i++)
{
myVec[i] = myMat[i, col];
}
This operation is very slow. In Matlab I can extract the vector without the need for iteration, like so:
myVec = myMat[:,col];
Is there any way to do this in C#?

There are no such constructs in C# that will allow you to work with arrays as in Matlab. With the code you already have you can speed up process of vector creation using Task Parallel Library that was introduced in .NET Framework 4.0.
Parallel.For(0, NTerms, i => myVec[i] = myMat[i, col]);
If your CPU has more than one core then you will get some improvement in performance otherwise there will be no effect.
For more examples of how Task Parallel Library could be used with matrixes and arrays you can reffer to the MSDN article Matrix Decomposition.
But I doubt that C# is a good choice when it comes to some serious math calculations.

Some possible problems:
Could it be the way that elements are accessed for multi-dimensional arrays in C#. See this earlier article.
Another problem may be that you are accessing non-contiguous memory - so not much help from cache, and maybe you're even having to fetch from virtual memory (disk) if the array is very large.
What happens to your speed when you access a whole row at a time, instead of a column? If that's significantly faster, you can be 90% sure it's a contiguous-memory issue...

How come my class take so much space in memory?

I will have literally tens of millions of instances of some class MyClass and want to minimize its memory size. The question of measuring how much space an object takes in the memory was discussed in Find out the size of a .net object
I decided to follow Jon Skeet's suggestion, and this is my code:
// Edit: This line is "dangerous and foolish" :-)
// (However, commenting it does not change the result)
// [StructLayout(LayoutKind.Sequential, Pack = 1)]
public class MyClass
{
public bool isit;
public MyClass nextRight;
public MyClass nextDown;
}
class Program
{
static void Main(string[] args)
{
var a1 = new MyClass(); //to prevent JIT code mangling the result (Skeet)
var before = GC.GetTotalMemory(true);
MyClass[] arr = new MyClass[10000];
for (int i = 0; i < 10000; i++)
arr[i] = new MyClass();
var after = GC.GetTotalMemory(true);
var per = (after - before) / 10000.0;
Console.WriteLine("Before: {0} After: {1} Per: {2}", before, after, per);
Console.ReadLine();
}
}
I run the program on 64 bit Windows, Choose "release", platform target: "any cpu", and choose "optimize code" (The options only matter if I explicitly target x86) The result is, sadly, 48 bytes per instance.
My calculation would be 8 bytes per reference, plus 1 byte for bool plus some ~8byte overhead. What is going on? Is this a conspiracy to keep RAM prices high and/or let non-Microsoft code bloat? Well, ok, I guess my real question is: what am I doing wrong, or how can I minimize the size of MyClass?
Edit: I apologize for being sloppy in my question, I edited a couple of identifier names. My concrete and immediate concern was to build a "2-dim linked-list" as a sparse boolean matrice implementation, where I can get an enumeration of set values in a given row/column easily. [Of course that means I have to also store the x,y coordinates on the class, which makes my idea even less feasible]

Approach the problem from the other end. Rather than asking yourself "how can I make this data structure smaller and still have tens of millions of them allocated?" ask yourself "how can I represent this data using a completely different data structure that is far more compact?"
It looks like you are building a doubly-linked list of bools, which, as you note, uses thirty to fifty times more memory than it needs to. Is there some reason why you're not simply using a BitArray to store your list of bools?
UPDATE:
in fact I was trying to implement a sparse boolean two-dimensional matrix
Well why didn't you say so in the first place?
When I want to make a sparse Boolean two-d matrix of enormous size, I build an immutable persistent boolean quadtree with a memoized factory. If the array is sparse, or even if it is dense but self-similar in some way, you can achieve enormous compressions. Square arrays of 264 x 264 Booleans are easily representable even though obviously as a real array, that would be more memory than exists in the world.
I have been toying with the idea of doing a series of blog articles on this technique; I will likely do so in late March. (UPDATE: I did not write that article in March 2012; I wrote it in August 2020. https://ericlippert.com/2020/08/17/life-part-32/)
Briefly, the idea is to make an abstract class Quad that has two subclasses: Single, and Multi. "Single" is a doubleton -- like a singleton, but with exactly two instances, called True and False. A Multi is a Quad that has four sub-quads, called NorthEast, SouthEast, SouthWest and NorthWest.
Each Quad has an integer "level"; the level of a Single is zero, and a multi of level n is required to have all of its children be Quads of level n-1.
The Multi factory is memoized; when you ask it to make a new Multi with four children, it consults a cache to see if it has made it before. If it has, it does not construct a new one; it hands out the old one. Since Quads are immutable, you do not have to worry about someone changing the Quad on you after it is in the cache.
Consider now how many memory words (a word is 4 or 8 bytes depending on architecture) an "all false" Multi of level n consumes. A level 1 "all false" multi consumes four words for the links to its children, a word for the level count (if necessary; you are not required to keep the level in the multi, though it helps for debugging) and a couple words for the sync block and so on. Let's call it eight words. (Plus the memory for the False Single quad, which we can assume is a constant two or three words, and thereby may be ignored.)
A level 2 "all false" multi consumes the same eight words, but each of its four children is the same level 1 multi. Therefore the total consumption of the level 2 "all false" multi is let's say 16 words.
The same for the level 3, 4,... and so on. The total memory consumption for a level 64 multi that is logically a 264 x 264 square array of Booleans is only 64 x 16 memory words!
Make sense? Hopefully that is enough of a sketch to get you going. If not, see my blog link above.

8 (object reference) + 8 (object reference) + 1 (bool) + 16 (header) + 8 (reference in array itself) = 41
Even if it's misaligned internally, each will be aligned on the heap. So we're looking at least 48bytes.
I can't for the life of me see why you'd want a linked list of bools though. A list of them would take 48times less space, and that's before you get to optimisations of storing a bool per bit which would make it 384 times smaller. And easier to manipulate.

If these hundreds of millions of instances of the class are mostly copies of the class with minor variations in class property values, then your system is a prime candidate to use what is called the Flyweight pattern. This pattern minimizes memory use by using the same instanes over and over, and just changing the properties as needed...

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.