I'm working on a "falling sand" style of game.
I've tried many ways of drawing the sand to the screen, however, each way seems to produce some problem in one form or another.
List of things I've worked through:
Drawing each pixel individually, one at a time from a pixel sized texture. Problem: Slowed down after about 100,000 pixels were changing per update.
Drawing each pixel to one big texture2d, drawing the texture2d, then clearing the data. Problems: using texture.SetPixel() is very slow, and even with disposing the old texture, it would cause a small memory leak (about 30kb per second, which added up quick), even after calling dispose on the object. I simply could not figure out how to stop it. Overall, however, this has been the best method (so far). If there is a way to stop that leak, I'd like to hear it.
Using Lockbits from bitmap. This worked wonderfully from the bitmaps perspective, but unfortunately, I still had to convert the bitmap back to a texture2d, which would cause the frame rate to drop to less than one. So, this has the potential to work very well, if I can find a way to draw the bitmap in xna without converting it (or something).
Setting each pixel into a texture2d with set pixel, by replacing the 'old' position of pixels with transparent pixels, then setting the new position with the proper color. This doubled the number of pixel sets necessary to finish the job, and was much much slower than using number 2.
So, my question is, any better ideas? Or ideas on how to fix styles 2 or 3?
My immediate thought is that you are stalling the GPU pipeline. The GPU can have a pipeline that lags several frames behind the commands that you are issuing.
So if you issue a command to set data on a texture, and the GPU is currently using that texture to render an old frame, it must finish all of its rendering before it can accept the new texture data. So it waits, killing your performance.
The workaround for this might be to use several textures in a double- (or even triple- or quad-) buffer arrangement. Don't attempt to write to a texture that you have just used for rendering.
Also - you can write to textures from a thread other than your rendering thread. This might come in handy, particularly for clearing textures.
As you seem to have discovered, it's actually quicker to SetData in large chunks, rather than issue many, small SetData calls. Determining the ideal size for a "chunk" differs between GPUs - but it is a fair bit bigger than a single pixel.
Also, creating a texture is much slower than reusing one, in raw performance terms (if you ignore the pipeline effect I just described); so reuse that texture.
It's worth mentioning that a "pixel sprite" requires sending maybe 30 times as much data per-pixel to the GPU than a texture.
See also this answer, which has a few more details and some in-depth links if you want to go deeper.
Related
I need to render an object multiple times a frame using different textures. I was wondering about the most performant way to do so. My first approach was to have one Material and in OnRenderImage() call SetTexture() on it for the given number of textures I have. Now I'm wondering if it would be a noticeable improvement if I set up one Material per Texture in Start() and change between Materials in OnRenderImage(). Then I wouldn't need the SetTexture() call. But I can't find what SetTexture() actually does. Does it just set a flag or does it copy or upload the texture somewhere?
From working with low-end devices extensively, performance comes from batching. It's hard to pin point what would improve performance in your context w/o a clear understanding of the scope:
How many objects
How many images
Target platform
Are the images packaged or external at runtime
...
But as a general rule you want to use least amount of materials and individual textures possible. If pre-processing is an option, I would recommend creating spritesheets with as many images as possible on a single image. Then using UV offsetting you can show multiple images on multiple objects for 1 draw call.
I have used extensively a solution called TexturePacker which supports Unity. It's not cheap, there's an app to buy plus a plugin for Unity but it saves time, and draw calls, in the end.
Things like packing hundreds of images into a few 4k textures and down to 3 or 4 draw calls vs 100s before.
That might not be a solution in your case, but the concept is still valid.
Also unity prefabs will not save Draw Calls, but reduce memory usage.
hth.
J.
In my 2D game I have large map, and I scroll around it (like in Age of Empires).
On draw I draw all the elements (which are textures/Images). Some of them are on the screen, but most of them are not (because you see about 10% of the map).
Does XNA know not to draw them by checking that the destination rectangle won't fall on the screen?
Or should I manually check it and avoid drawing them at all?
A sometimes overlooked concept of CPU performance while using SpriteBatch is how you go about batching your sprites. And using drawable game component doesn't lend itself easily to efficiently batching sprites.
Basically, The GPU drawing the sprites is not slow & the CPU organizing all the sprites into a batch is not slow. The slow part is when the CPU has to communicate to the GPU what it needs to draw (sending it the batch info). This CPU to GPU communication does not happen when you call spriteBatch.Draw(). It happens when you call spriteBatch.End(). So the low hanging fruit to efficiency is calling spriteBatch.End() less often. (of course this means calling Begin() less often too). Also, use spriteSortMode.Immediate very sparingly because it immediately causes the CPU to send each sprites info to the GPU (slow)).
So if you call Begin() & End() in each game component class, and have many components, you are costing yourself a lot of time unnecessarily and you will probably save more time coming up with a better batching scheme than worrying about offscreen sprites.
Aside: The GPU automatically ignores offscreen sprites from its pixel shader anyway. So culling offscreen sprites on the CPU won't save GPU time.
Reference here.
You must manually account for this, and it will largely effect the performance of the game as it grows, this is otherwise known as culling. Culling is not done just because drawing stuff off screen reduces performance, it is because calling Draw that many extra times is slow. Anything you don't need to update that is out of the viewport should be excluded too. You can see more about how you can do this and how SpriteBatch handles this here.
I can easily think of a number of situations where it would be useful to change a single pixel in a Texture2D, especially because of the performance hit and inconvenience you get when constantly doing GetData<>(); SetData<>(); every frame or drawing to a RenderTarget2D.
Is there any real reason not to expose setter methods for single pixels? If not, is there a way to modify a single pixel without using the methods above?
Texture data is almost always copied to video memory (VRAM) by the graphics driver when the texture is initialized, for performance reasons. This makes texture fetches by the shaders running on the GPU significantly faster; you would definitely not be happy if every texture cache miss had to fetch the missing data over the PCIe bus!
However, as you've noticed, this makes it difficult and/or slow for the CPU to read or modify the data. Not only is the PCIe bus relatively slow, but VRAM is generally not directly addressable by the CPU; data must usually be transferred using special low-level DMA commands. This is exactly why you see a performance hit when using XNA's GetData<>() and SetData<>(): it's not the function call overhead that's killing you, it's the fact that they have to copy data back and forth to VRAM behind your back.
If you want to modify data in VRAM, the low-level rendering API (e.g. OpenGL or Direct3D 11) gives you three options:
Temporarily "map" the pixel data before your changes (which involves copying it back to main memory) and "unmap" it when your edits are complete (to commit the changes back to VRAM). This is probably what GetData<>() and SetData<>() are doing internally.
Use a function like OpenGL's glTexSubImage2D(), which essentially skips the "map" step and copies the new pixel data directly back to VRAM, overwriting the previous contents.
Instruct the GPU to make the modifications on your behalf, by running a shader that writes to the texture as a render target.
XNA is built on top of Direct3D, so it has to work within these limitations as well. So, no raw pixel data for you!
(As an aside, everything above is also true for GPU buffer data.)
I've taken on quite a daunting challenge for myself. In my XNA game, I want to implement Blargg's NTSC filter. This is a C library that transforms a bitmap to make it look like it was output on a CRT TV with the NTSC standard. It's quite accurate, really.
The first thing I tried, a while back, was to just use the C library itself by calling it as a dll. Here I had two problems, 1. I couldn't get some of the data to copy correctly so the image was messed up, but more importantly, 2. it was extremely slow. It required getting the XNA Texture2D bitmap data, passing it through the filter, and then setting the data again to the texture. The framerate was ruined, so I couldn't go down this route.
Now I'm trying to translate the filter into a pixel shader. The problem here (if you're adventurous to look at the code - I'm using the SNES one because it's simplest) is that it handles very large arrays, and relies on interesting pointer operations. I've done a lot of work rewriting the algorithm to work independently per pixel, as a pixel shader will require. But I don't know if this will ever work. I've come to you to see if finishing this is even possible.
There's precalculated array involved containing 1,048,576 integers. Is this alone beyond any limits for the pixel shader? It only needs to be set once, not once per frame.
Even if that's ok, I know that HLSL cannot index arrays by a variable. It has to unroll it into a million if statements to get the correct array element. Will this kill the performance and make it a fruitless endeavor again? There are multiple array accesses per pixel.
Is there any chance that my original plan to use the library as is could work? I just need it to be fast.
I've never written a shader before. Is there anything else I should be aware of?
edit: Addendum to #2. I just read somewhere that not only can hlsl not access arrays by variable, but even to unroll it, the index has to be calculable at compile time. Is this true, or does the "unrolling" solve this? If it's true I think I'm screwed. Any way around that? My algorithm is basically a glorified version of "the input pixel is this color, so look up my output pixel values in this giant array."
From my limited understanding of Shader languages, your problem can easily be solved by using texture instead of array.
Pregenerate it on CPU and then save as texture. 1024x1024 in your case.
Use standard texture access functions as if texture was the array. Posibly using nearest-neighbor to limit blendinding of individual pixels.
I dont think this is possible if you want speed.
What im doing is rendering a number of bitmaps to a single bitmap. There could be hundreds of images and the bitmap being rendered to could be over 1000x1000 pixels.
Im hoping to speed up this process by using multiple threads but since the Bitmap object is not thread-safe it cant be rendered to directly concurrently. What im thinking is to split the large bitmap into sections per cpu, render them separately then join them back together at the end. I haven't done this yet incase you guys/girls have any better suggestions.
Any ideas? Thanks
You could use LockBits and work on individual sections of the image.
For an example of how this is done you can look at the Paint.Net source code, especially the BackgroundEffectsRenderer (yes that is a link to the mono branch, but the Paint.Net main code seems to be only available in zip files).
Lee, if you're going to use the Image GDI+ object, you may just end up doing all the work twice. The sections that you generate in multiple threads will need to be reassembled at the end of your divide and conquer approach and wouldn't that defeat the purpose of dividing in the first place?
This issue might only be overcome if you're doing something rather complex in each of the bitmap sections that would be much more processing time than simply redrawing the image subparts onto the large bitmap without going to all that trouble.
Hope that helps. What kind of image rendering are you planning out?
You could have each thread write to a byte array, then when they are all finished, use a single thread to create a bitmap object from the byte arrays. If all other processing has been done before hand, that should be pretty quick.
I've done something similar and in my case I had each thread lock x (depended on the size of the image and the number of threads) many rows of bits in the image, and do their writing to those bits such that no threads ever overlapped their writes.
One approach would be to render all the small bitmaps onto an ersatz bitmap, which would just be a two-dimensional int array (which is kind of all a Bitmap really is anyway). Once all the small bitmaps are combined in the big array, you do a one-time copy from the big array into a real Bitmap of the same dimensions.
I use this approach (not including the multi-threaded aspect) all the time for complex graphics on Windows Mobile devices, since the memory available for creating "real" GDI+ Bitmaps is severely limited.
You could also just use a Bitmap as you originally intended. Bitmap is not guaranteed to be thread-safe, but I'm not sure that would be a problem as long as you could assure that no two threads are ever overwriting the same portion of the bitmap. I'd give it a try, at least.
Update: I just re-read your question, and I realized that you're probably not going to see much (if any) improvement in the overall speed of these operations by making them multi-threaded. It's the classic nine-women-can't-make-a-baby-in-one-month problem.