Slow OpenGL Geometry Shader DrawArrays / Transform Feedback - c#

I am using OpenGL (via OpenTK) to perform spatial queries on lots of point cloud data on the GPU. Each frame of data is around 200k points. This works well flor low numbers of queries (<10queries # 60fps) but does not scale as more are performed per data frame (100queries # 6fps).
I would have expected modern GPUs to be able to chew through 20 million points (200k * 100 queries) points from 100 draw calls without breaking a sweat; especially since each glDrawArrays uses the same VBO.
A 'spatial query' consists of setting some uniforms and a glDrawArrays call. The geom shader then chooses to emit or not emit a vertex based on the result of the query. I have tried with / without branching and it makes no difference. The VBO used is separated attributes, one is STATIC_DRAW and other is DYNAMIC_DRAW (updated before each batch frame of spatial queries). Transform feedback then collects the data.
Profiling shows that glGetQueryObject is by far the slowest call (probably blocking, 5600 inclusive samples compared to 127 from glDrawArrys) but I'm not sure how to improve this. I tried making lots of little result buffers in GPU memory and binding a new transform feedback buffer for each query, but this had no effect - perhaps due to running on a single core? The other option would be to read the video memory from the previous query from another thread, but this throws an Access Violation and I'm unsure if the gains would be that significant.
Any thoughts on how to improve performance? Am I missing something obvious like a debug mode that needs switching off?

Related

Spatial Understanding limited by a small area

I am working with the Hololens in Unity and trying to map a large area (15x15x25) meters. I am able to map the whole area using the SpatialMapping prefab, but I want to do some spatial processing on that mesh to smoothen out the floors and walls. I have been trying to use SpatialUnderstanding for this, but there seems to be a hard limit on how big of an area you can scan with this, which has been detailed by hololens forums thread.
Currently, I don't understand how the pipeline of data works from SpatialMapping to SpatialUnderstanding. Why can I not simply use the meshes generated from SpatialMapping in SpatialUnderstanding? Is there some better method of creating smooth surfaces?
This solution works best for pre-generated rooms. In other words a general solution, one that could be expected to be used by end users, is not possible, given the current limitations.
I will start with the last question: "Is there some better method of creating smooth surfaces?"
Yes, use a tripod on wheels to generate the initial scan. Given the limited resolution of the accelometers and compasses in the hardware, reducing the variance in one linear axis, height, and one rotational axis, roll(roll should not vary during at all during a scan), will result in a much more accurate scan.
The other method to create smooth surfaces is to export the mesh to a 3D editing program and manually flatten the surfaces, then reimport the mesh into Unity3D.
"Why can I not simply use the meshes generated from SpatialMapping in SpatialUnderstanding?"
SpacialUnderstanding further divides the generated mesh into (8cm,8cm,8cm) voxels and then calculates the surfels based on each voxel. To keep performance and memory utilization in check, a hard limit of approximately(10m,10m,10m). That is implemented as(128,128,128) voxels.
Any attempt to use SpacialUnderstanding beyond its defined limits, will produce spurious results due to overflow of the underlying data structures.

Reading and Writing to the GPU, Tips And Tricks for improving speed (especially in this scenario)

Currently my application has a major bottleneck when it comes to GPU CPU data sharing.
Basically I am selecting multiple items, each item becomes a buffer and then becomes a 2D texture (of the same size) and they all get blended together on the GPU. After which I need to know various things about the blend result. Which is on the GPU as a (single channel float) texture:
Maximum & Minimum value in the texture
Average value
Sum Value
Effectively I ended up with the very slow round about of
Put data on the GPU * N
Read data from GPU
Cycle data on CPU looking for values
Obviously a CPU profile shows the 2 major hot spots as the writes and the read. the textures are in the 100x100s not 1000x1000s but there are a lot of them.
There are 3 things I am currently considering
Combine all the data & find out interesting data before putting on GPU (seems pointless putting it on the GPU at all & some of the blends are complex)
When loading the data put it all onto the GPU (as texture levels, therefore skipping the lag on item selection in favor of a slower load)
Calculate the "interesting data" on the GPU and just have the CPU read back those values
On my machine and the data I have worked with, throwing all the data on the GPU would barely use the GPU memory. Highest I have seen so far is 9000 entries of 170 X 90, as its single channel float, by my maths that comes out as 1/2 GB. Which isn't a problem on my machine, but I could see it being a problem on the average laptop. Can I get a GPU to page from HDD? Is this even worth pursuing?
Sorry for asking such a broad question but I am looking for the most fruitful avenue to pursue and each avenue would be new ground to me. Profiling seems to highlight readback as the biggest problem at the moment. Could I improve this by changing FBO/Texture settings?
At the moment I am working in SharpGL and preferably need to stick to OpenGL 3.3. If however there is a route for rapid improvement in performance for any particular technique that is out of reach via either video memory or GL version I might be able to make a case to up the software system requirements.

Unity C# Voxel finite water optimization

I got a (basic) voxel engine running and a water system that looks (and I assume basically works) like this: https://www.youtube.com/watch?v=Q_TdeGIOOts (not my game).
The water values are stored in a 3d Array of floats, and every 0.05s it calculates water flow by checking the voxel below and adjacent (y-1, x-1, x+1, z-1, z+1) and adds the value.
This system works fine (70+ fps) for small amounts of water, but when I start calculating water on 8+ chunks, it gets too much.
(I disabled all rendering or mesh creation to check if that is the bottleneck, it isnt. Its purely the flow calculations).
I am not a very experienced programmer so I wouldnt know where to start optimizing, apart from making the calculations happen in a coroutine as I already did.
In this post: https://gamedev.stackexchange.com/questions/55414/how-to-define-areas-filled-with-water (near the bottom) Boreal suggests running it in a compute shader. Is this the way to go for me? And how would I go about such a thing?
Any help is much appreciated.
If you're really calculating a voxel based simulation, you will be expanding the number of calculations geometrically as your size increases, so you will quickly run out of processing power on larger volumes.
A compute shader is great for doing massively parallel calculations quickly, although it's a very different programming paradigm that takes some getting used to. A compute shader will look at the contents of a buffer (ie, a 'texture' for us civilians) and do things to it very quickly -- in your case the buffer will probably be a buffer/texture whose pixel values represent water cells. If you want to do something really simple like increment them up or down the compute shader uses the parallel processing power of the GPU to do it really fast.
The hard part is that GPUs are optimized for parallel processing. This means that you can't write code like "texelA.value += texelB.value" - without extra work on your part, each fragment of the buffer is processed with zero knowledge of what happens in the other fragments. To reference other texels you need to read the texture again somehow - some techniques read one texture multiple times with offsets (this GL example does this to implement blurs, others do it by repeatedly processing a texture, putting the result into a temporary texture and then reprocessing that.
At the 10,000 foot level: yes, a compute shader is a good tool for this kind of problem since it involves tons of self-similar calculation. But, it won't be easy to do off the bat. If you have not done conventional shader programming before, You may want to look at that first to get used to the way GPUs work. Even really basic tools (if-then-else or loops) have very different performance implications and uses in GPU programming and it takes some time to get your head around the differences. As of this writing (1/10/13) it looks like Nvidia and Udacity are offering an intro to compute shader course which might be a good way to get up to speed.
FWIW you also need pretty modern hardware for compute shaders, which may limit your audience.

Will Direct 2D be better than Qt at rendering lines onto a off screen buffer

I have a data visualization application (Parallel Coordinates). This involves drawing loads of lines on screen. The application is for huge datasets. The test data set involves ~ 2.5M lines (2527700 exact). The screen is going to be cluttered but it shows some pattern. The application has facility to scale along X and Y. At typical zoom levels it draws around 300K lines. The application is written in Qt and the time taken to render is respectable. Typical numbers are (time in ms)
Time taken: 496
Drew 1003226 lines
Time taken: 603
Drew 1210032 lines
Time taken: 112
Drew 344582 lines
Time taken: 182
Drew 387960 lines
Time taken: 178
Drew 361424 lines
Time taken: 222
Drew 676470 lines
Time taken: 171
Drew 475652 lines
Time taken: 251
Drew 318709 lines
Time taken: 5
Drew 14160 lines
Time taken: 16
Drew 27233 lines
The following code segment is used to time and count line segments drawn. Rendering happens to a off screen image (QImage with format Format_ARGB32_Premultiplied). The size of the QImage is at max 1366 x 768. segments type is QVector.
QTime m_timer;
m_timer.start();
painter.drawLines(segments);
qDebug() << "Time taken: " << m_timer.elapsed();
painter.end();
qDebug() << "Drew " << segments.size();
This QImage is cached to save future draws. I have never used DirectX. Will Direct 2D rendering give any more performance advantage than what I already have. Is there any way to improve on these numbers?
If Direct 2D rendering can improve these numbers what tech stack to use? Will using C#/SharpDX be better? I ask this because Qt can do DirectX through translation only (not sure what the cost is) and since the app is predominantly windows C# might ease the dev. process.
It might be optimal to create your own line drawing function, operating on QImage memory buffer directly. There you can make some assumptions which Qt's painter probably can not do and achieve better performance. Then blit the image to screen antialiased (and possibly zoomed?) using QGLWidget.
Good line drawing code: http://en.wikipedia.org/wiki/Bresenham%27s_line_algorithm#Simplification .
Though for C++ and memory buffer implemenation, it can be further optimized by replacing x0 and y0 coordinates with single pointer.
Most importantly, you can then get line drawing inlined within you lines loop and have zero function calls while iterating over a million lines.
If you let the blitting part do antialiasing, you can just set a memory word per pixel without checking what kind of pen it is etc.
Also doing zooming in HW to any desired target resolution, you can freely optimize the QImage size for getting the pattern you want, separate from showing it on screen.
If you can have the lines mostly sorted by Y coordinate, that will help with caching, as setting pixels of sequential lines will be closer in memory in the QImage buffer.
However you do it, crank up optimization (-O3 flag for gcc, not sure of MS compiler), default is not max at least for gcc.
Another thing you can do is parallelize the drawing at per-image level, that will work even with QPainter. It is allowed to draw on QImage in other than main thread, and Qt makes it easy to just send QImages as signal parameter from drawing threads to GUI thread. What you do is heavily CPU and memory bound so better have drawing thread count same as real core count, not counting hyperthreading.
Don't know Qt, but for Direct2D, you wouldn't have good performance drawing millions of antialiased lines on Windows 7 (without latest IE10/SP update). On Windows 8, Direct2D has been slightly improved, but I'm not sure It can handle such a rate. If you want to achieve better performance, you would have to use directly Direct3D with a MSAA surface or possibly some screespace posteffect like FXAA, or using geometry aware anti-aliasing techniques like GPAA/GBAA. Check "Filtering Approaches for Real-Time Anti-Aliasing" comprehensive list of state of the art real-time anti-aliasing techniques.
Maybe I won't answer your question directly, but:
I guess, that Qt uses standard system routines to perform graphical operations, so it means GDI and/or GDI+ here. Neither of these if fast, both do all computations on CPU. On the other hand Direct2D from DirectX11+ attempts to accelerate drawings using GPU power if possible, so it may work faster.
I don't know Qt well, but I guess, that you should be able to retreive a handle to window (=control) you want to draw on somehow (even if that means using WinAPI - there's always a way). If so, you can bypass all Qt mechanisms and draw directly on that window using DirectX with no additional overhead.
As for SharpDX - I didn't know, that such library exists, thanks for info. They published a paper informing about performance of operation, you may want to check it out: http://code4k.blogspot.com/2011/03/benchmarking-cnet-direct3d-11-apis-vs.html . Anyway, it seems to be a more or less simple wrapper around DirectX headers, so it shouldn't add much overhead to your application (unless performance is really critical).

Why doesn't `Texture2D` expose its pixel data?

I can easily think of a number of situations where it would be useful to change a single pixel in a Texture2D, especially because of the performance hit and inconvenience you get when constantly doing GetData<>(); SetData<>(); every frame or drawing to a RenderTarget2D.
Is there any real reason not to expose setter methods for single pixels? If not, is there a way to modify a single pixel without using the methods above?
Texture data is almost always copied to video memory (VRAM) by the graphics driver when the texture is initialized, for performance reasons. This makes texture fetches by the shaders running on the GPU significantly faster; you would definitely not be happy if every texture cache miss had to fetch the missing data over the PCIe bus!
However, as you've noticed, this makes it difficult and/or slow for the CPU to read or modify the data. Not only is the PCIe bus relatively slow, but VRAM is generally not directly addressable by the CPU; data must usually be transferred using special low-level DMA commands. This is exactly why you see a performance hit when using XNA's GetData<>() and SetData<>(): it's not the function call overhead that's killing you, it's the fact that they have to copy data back and forth to VRAM behind your back.
If you want to modify data in VRAM, the low-level rendering API (e.g. OpenGL or Direct3D 11) gives you three options:
Temporarily "map" the pixel data before your changes (which involves copying it back to main memory) and "unmap" it when your edits are complete (to commit the changes back to VRAM). This is probably what GetData<>() and SetData<>() are doing internally.
Use a function like OpenGL's glTexSubImage2D(), which essentially skips the "map" step and copies the new pixel data directly back to VRAM, overwriting the previous contents.
Instruct the GPU to make the modifications on your behalf, by running a shader that writes to the texture as a render target.
XNA is built on top of Direct3D, so it has to work within these limitations as well. So, no raw pixel data for you!
(As an aside, everything above is also true for GPU buffer data.)

Categories