Uniform buffer size on Nvidia GPUs - c#

I use C# with OpenTK to access OpenGL API. My project uses tessellation to render a heightmap. My tessellation control shader splits a square into a grid of 64 squares and my tessellation evaluation shader adds vertical offsets to those points. Vertical offsets are stored in a uniform float buffer like this:
uniform float HeightmapBuffer[65 * 65];
Everything works fine, when I run the project on my laptop with AMD Radeon 8250 GPU. The problems start when I try to run it on Nvidia graphic cards. I tried an older GT 430 and a brand new GTX 1060, but results are same:
Tessellation evaluation info
----------------------------
0(13) : error C5041: cannot locate suitable resource to bind variable "HeightmapBuffer". Possibly large array.
As I researched this problem, I found GL_MAX_UNIFORM_BLOCK_SIZE variable which returns ~500MB on the AMD and 65.54 kB on both Nvidia chips. It's a little strange, since my array actually uses only 16.9 kB, so I am not even sure if the "BLOCK SIZE" actually limits the size of one variable. Maybe it limits the size of all uniforms passed to one shader? Even so, I can't believe that my program would use 65 kB.
Note that I also tried to go the 'common' way by using a texture, but I think there were problems with interpolation, so when placing two adjacent heightmaps together, the borders didn't match. With a uniform buffer array on the other side, things work perfectly.
So what is the actual meaning of GL_MAX_UNIFORM_BLOCK_SIZE? Why is this value on Nvidia GPUs so low? Is there any other way to pass a large array to my shader?

As I researched this problem, I found GL_MAX_UNIFORM_BLOCK_SIZE variable which returns ~500MB on the AMD and 65.54 kB on both Nvidia chips.
GL_MAX_UNIFORM_BLOCK_SIZE is the wrong limit. That applies only to Uniform Buffer Objects.
You just declare an array
uniform float HeightmapBuffer[65 * 65];
outside of a uniform block. Since you seem to use this in a tesselation evaluation shader, the relevant limit is MAX_TESS_EVALUATION_UNIFORM_COMPONENTS (there is a separate such limit for each programmable stage). This component limit counts just the number of float components, so a vec4 will consume 4 components, a float just one.
In your particular case, the latest GL spec, [GL 4.6 core profile] (https://www.khronos.org/registry/OpenGL/specs/gl/glspec46.core.pdf)
at the time of this writing, just guarantees a minimum value of 1024 for that (=4kiB), and you are way beyond that limit.
It is actually a very bad idea to use plain uniforms for such amounts of data. You should consider using UBOs, Texture Buffer Objects, Shader Storage Buffer Objects or even plain textures to store your array. UBOs would probably be the most natural choice in your scenario.

Related

Improve performance of per-pixel image multiplication with mask and thresholding

I am looking for suggestions to make this algorithm run faster. Using C# and EmguCV (a managed wrapper over OpenCV), I am processing images in real-time, adjusting the gain of pixels that are not fully saturated (basically a flat-field correction with a threshold). The code works correctly, but the CPU usage is relatively high due to frame-rate and image size.
I have dabbled a bit with using UMat to perform the same operations, but performance has been worse. I also tried a version with integer math (i.e., doing integer multiplication and division instead of a FP multiply), but that was slower than leaving it in floating point.
I have profiled the code and optimized it some by doing things like reusing my Mat objects, pointing Mats to existing blocks of memory instead allocating new memory, etc. Suggestions using OpenCV would also be welcome, and I will find/use the EmguCV equivalents.
var ushortImage = new Mat(_frameH, _frameW, DepthType.Cv16U, 1, (IntPtr)pixelsPtr, _frameW * 2);
//get mask of all the pixels that are at max-value
CvInvoke.InRange(ushortImage, _lowerRange, _upperRange, _byteMask);
//convert input to float for multiplication
ushortImage.ConvertTo(_floatFrame, DepthType.Cv32F);
//multiply each pixel by it's gain value
CvInvoke.Multiply(_floatFrame, gainMat, _floatFrame);
//threshold to allowable range
CvInvoke.Threshold(_floatFrame, _floatFrame, _frameMaxPixelValue, _frameMaxPixelValue, ThresholdType.Trunc);
//convert back to 16-bit
_floatFrame.ConvertTo(ushortImage, DepthType.Cv16U);
//restore the max-intensity pixels
ushortImage.SetTo(new MCvScalar(_frameMaxPixelValue), _byteMask);
Here is the profiling info from dotTrace. Nothing really stands out as being the bottleneck, so I'm hoping there's a way to combine some operations into a single function call. Even a 10% improvement would be helpful because I'm trying to basically double the throughout of the application.

Sampling an arbitrary point within a DFT?

What I'm trying to do: I want to compress a 2D grey-scale map (2D array of float values between 0 and 1) into a DFT. I then want to be able to sample the value of points in continuous coordinates (i.e. arbitrary points in between the data points in the original 2D map).
What I've tried: So far I've looked at Exocortex and some similar libraries, but they seem to be missing functions for sampling a single point or performing lossy compression. Though the math is a bit above my level, I might be able to derive methods do do these things. Ideally someone can point me to a C# library that already has this functionality. I'm also concerned that libraries that use the row-column FFT algorithm don't produce sinusoid functions that can be easily sampled this way since they unwind the 2D array into a 1D array.
More detail on what I'm trying to do: The intended application for all this is an experiment in efficiently pre-computing, storing, and querying line of sight information. This is similar to the the way spherical harmonic light probes are used to approximate lighting on dynamic objects. A grid of visibility probes store compressed visibility data using a small number of float values each. From this grid, an observer position can calculate an interpolated probe, then use that probe to sample the estimated visibility of nearby positions. The results don't have to be perfectly accurate, this is intended as first pass that can cheaply identify objects that are almost certainly visible or obscured, and then maybe perform more expensive ray-casting on the few on-the-fence objects.

3D Buffers in HLSL?

I wanna send a series of integers to HLSL in the form of a 3D array using unity. I've been trying to do this for a couple of days now, but without any gain. I tried to pack the buffers into each other (StructuredBuffer<StructuredBuffer<StructuredBuffer<int>>>), but it simply won't work. And I need to make this thing resizable, so I can't use arrays in structs. What should I do?
EDIT: To clarify a bit more what I am trying to do here, this is a medical program. When you go make a scan of your body, some files are generated. Those files are called DICOM files(.dcm). Those files are given out to a doctor. The doctor should open the program, select all of the DICOM files and load them. Each DICOM file contains an image. However, those images are not as the normal images used in our daily life. Those images are grayscale and each pixel has a value that ranges between -1000 to a couple of thousands, so each pixel is saved as 2 bytes(or an Int16). I need to generate a 3D model of the body that got scanned, so I'm using the Marching Cubes algorithm to generate it(have a look at Polygonising a Scalar Field). The problem is I used to loop over each pixel in about 360 512*512 sized images, which took too much time. I used to read the pixel data from each file once I needed it when I used the CPU. Now I'm trying to make this process occur at runtime. I need to send all of the pixel data to the GPU before processing it. That's my problem. I need the GPU to read data from disk. Because that ain't possible, I need to send 360*512*512*4 bytes of data to the GPU in the form of 3D array of Ints. I'm also planning to keep the data there to avoid retransfer of that huge amount of memory. What should I do? Refer to this link to know more about what I'm doing
From what I've understood, I would suggest to try the following:
Flatten your data (nested buffers are not what you want on your gpu)
Split your data across multiple ComputeBuffers if necessary (when I played around with them on a Nvidia Titan X I could store approximately 1GB of data per buffer. I was rendering a 3D point cloud with 1.5GB of data or something, the 360MBytes of data you mentioned should not be a problem then)
If you need multiple buffers: let them overlap as needed for your marching cubes algorithm
Do all of your calculations in a ComputeShader (I think requires DX11, if you have multiple buffer, run it multiple times and accumulate your results) and then use the results in a standard shader which your call from OnPostRender function (use Graphics.DrawProcedural inside to just draw points or build a mesh on the gpu)
Edit (Might be interesting to you)
If you want to append data to a gpu buffer (because you don't know the exact size or you can't write it to the gpu at once), you can use AppendBuffers and a ComputeShader.
C# Script Fragments:
struct DataStruct
{
...
}
DataStruct[] yourData;
yourData = loadStuff();
ComputeBuffer tmpBuffer = new ComputeBuffer(512, Marshal.SizeOf(typeof(DataStruct)));
ComputeBuffer gpuData = new ComputeBuffer(MAX_SIZE, Marshal.SizeOf(typeof(DataStruct)), ComputeBufferType.Append);
for (int i = 0; i < yourData.Length / 512; i++) {
// write data subset to temporary buffer on gpu
tmpBuffer.SetData(DataStruct.Skip(i*512).Take((i+1)*512).ToArray()); // Use fancy Linq stuff to select data subset
// set up and run compute shader for appending data to "gpuData" buffer
AppendComputeShader.SetBuffer(0, "inBuffer", tmpBuffer);
AppendComputeShader.SetBuffer(0, "appendBuffer", gpuData);
AppendComputeShader.Dispatch(0, 512/8, 1, 1); // 8 = gpu work group size -> use 512/8 work groups
}
ComputeShader:
struct DataStruct // replicate struct in shader
{
...
}
#pragma kernel append
StructuredBuffer<DataStruct> inBuffer;
AppendStructuredBuffer<DataStruct> appendBuffer;
[numthreads(8,1,1)]
void append(int id: SV_DispatchThreadID) {
appendBuffer.Append(inBuffer[id]);
}
Note:
AppendComputeShader has to be assigned via the Inspector
512 is an arbitrary batch size, there is an upper limit of how much data you can append to a gpu buffer at once, but I think that depends on the hardware (for me it seemed to be 65536 * 4 Bytes)
you have to provide a maximum size for gpu buffers (on the Titan X it seems to be ~1GB)
In Unity we currently have the MaterialPropertyBlock that allows SetMatrixArray and SetVectorArray, and to make this even sweeter, we can set globally using the Shader static helpers SetGlobalVectorArray and SetGlobalMatrixArray. I believe that these will help you out.
In case you prefer the old way, please look at this quite nice article showing how to pass arrays of vectors.

Offsetting calculations to the GPU

I'm performing a large number of calculations. Each calculation is independent of every other, in other words, the task could be parallelized and I'd like to offset the job to the GPU.
Specifically, I'm creating light/shadow maps for an OpenGL application, and the calculations are a bunch of Vector math, dot products, square roots, etc.
What are my options here? Does OpenGL natively support anything like this, or should I be looking for an external library/module?
Compute shader is the generic for CUDA, which is like an enhanced compute for nVidia. Note you don't need to use either, you can do calaculations using a vertex -> geomerty stream, or render to a pixel shader. So long as you can represent the results as a collection of values (a vertex buffer or texture), you can use the rendering pipeline to do your maths.

OpenGL draw every n vertex

I am working on writing an application that contains line plots of large datasets.
My current strategy is to load up my data for each channel into 1D vertex buffers.
I then use a vertex shader when drawing to assemble my buffers into vertices (so I can reuse one of my buffers for multiple sets of data)
This is working pretty well, and I can draw a few hundred million data-points, without slowing down too much.
To stretch things a bit further I would like to reduce the number of points that actually get drawn, though simple reduction (I.e. draw every n points) as there is not much point plotting 1000 points that are all represented by a single pixel)
One way I can think of doing this is to use a geometry shader and only emit every N points but I am not sure if this is the best plan of attack.
Would this be the recommended way of doing this?
You can do this much simpler by adjusting the stride of all vertex attributes to N times the normal one.

Categories