Author Topic: Optimise drawing, beyond using VertexArray? (Read 3545 times)

bobble · « **on:** November 23, 2013, 11:48:54 am »

I'm trying to optimise drawing to maximise the number of objects that can be drawn using the same texture. Here's the code:

Code: [Select]

      //draw single bullet type
      sf::RenderStates states;
      states.texture = &tex[bullet_type];
      window[0].draw(&vertex_arr[0], num_to_draw*4, sf::Quads, states);

Details:

Converted from Sprites to VertexArray
Minimise primitive/texture size so there's less pixels to blit
vertex_arr has position and texCoords set, color is not used/altered
Texture is 32-bit, but could use a reduced form of 1-bit alpha and small color palette rgb

Is there a next step to speed things up? For testing I'm drawing 1,000,000 7x7 objects per frame (whether the texture is 7x7 or 8x8 makes little difference). Currently at ~4FPS, the above draw call takes the majority of the time. I don't need it to be quick enough to draw 1,000,000 objects per frame, realistically objects will be measured in the thousands (bullet hell shmup, say 10k as a high upper ceiling), but my logic is the quicker it can be made the older the hardware that can be supported. As a shmup the game uses a fixed timestep with framerate capped at 60FPS.

binary1248 · « **Reply #1 on:** November 23, 2013, 12:27:46 pm »

If drawing a single VertexArray using a single Texture is already too slow for your purposes, the only way you can get better performance is to resort to raw OpenGL. This is pretty much where you will hit the SFML performance limit.

There are always 2 main things to consider. We assume you are currently GPU-bound (otherwise this discussion wouldn't make much sense). Depending on your graphics hardware, they might have general purpose compute units, or if you intend to support much older hardware, they have a fixed number of compute units per pipeline stage. Having general purpose compute units means it is easier for the GPU to maintain 100% utilization because it will shift the units to different stages according to what is needed more. This also means it is easier to optimize hotspots because you can easily determine where most of the GPU time is spent. Having a fixed number of shaders per stage as was the case with older hardware, you might have a bottleneck that is preventing another bottleneck from appearing during your measurements, thus it becomes a multi-stage optimization process to get all pipeline stages to 100% utilization if that is even possible (it occurred so rarely that it was the motivation to go general purpose).

What I can say from experience in writing high performance OpenGL code is that, most of the time spent in the GPU is during geometry processing and buffer operations. The former is governed by the amount of compute resources the GPU has available and the latter by the memory architecture/throughput (often referred to by fillrate). The goal is to minimize whatever is dominating the total draw time. To determine this, you need special tools provided by AMD/Nvidia that can read internal counters.

To reduce geometry processing load, you can try to cull what can't be seen anyway in a culling pass on the CPU before sending it to the GPU. This will raise CPU load a bit but will reduce CPU to GPU transfers (if it matters) and increase framerate if that is what the GPU is doing most of the time. There are many well-known occlusion and frustum culling algorithms, I'll leave it up to you to look for them.

To minimize buffer operations, you need to reduce the amount of data written to each of the framebuffers (colour, depth, stencil, etc.) every frame. The first easy way to do this is to mask off buffers you don't intend to use anyway so they aren't written to during drawing. Since SFML doesn't use more buffers than it needs to, this is already the case. Another well known way to do this is to reduce overdraw, i.e. drawing to the same framebuffer pixel over and over again. If you draw back to front, as SFML does, there is no way the GPU can optimize this. SFML does this because this is required for trivial blending support, and because SFML doesn't want to force the user to use a depth buffer. What you will see in a typical "SFML-only" application is a very high level of overdraw, but that doesn't impact performance in a significant way for most applications. If you draw front to back, with depth buffer support, the GPU can perform what is known as "early Z-cull" and discard a fragment relatively early in the pipeline once it has determined that it will be behind something else anyway thus skipping everything thereafter including the expensive buffer operations.

There are so many more ways to optimize drawing utilizing state of the art features of the GPU, but as I said, if drawing with a VertexArray is not good enough for you, you have reached SFML's limits.

wintertime · « **Reply #2 on:** November 23, 2013, 02:57:49 pm »

Yes, sf::VertexArray is pretty much the fastest way in SFML, although you could use sf::Vertex only for making your own array and give this to the RenderWindow.draw call.
Though what binary did not mention as something that holds performance back is that VA means sending the data each frame to the GPU, which then is possibly just waiting for it. You could write a VBO implementation using OpenGL for sending it once to GPU and then only let it do the processing each frame, but if you would change much of the data each frame it may not help. And you could add an element array object with indices, as that would be less data to transfer on a change and enables reusing of doubled vertices.

bobble · « **Reply #3 on:** November 23, 2013, 04:15:00 pm »

Quote from: binary1248 on November 23, 2013, 12:27:46 pm

...
To reduce geometry processing load, you can try to cull what can't be seen anyway in a culling pass on the CPU before sending it to the GPU.
...

Off-screen culling is done naturally already, the view is static and objects are destroyed when they move offscreen. This should account for 98% of what can be culled. The bullets should be on the top layer to stop 'invisible deaths', so the only thing to obscure bullets are other bullets. There are some fringe cases that could possibly benefit from extra culling, will investigate.

Quote from: wintertime on November 23, 2013, 02:57:49 pm

...
You could write a VBO implementation using OpenGL for sending it once to GPU and then only let it do the processing each frame, but if you would change much of the data each frame it may not help.
...

All of the position data changes each frame. Most bullets follow a predictable straight trajectory, some (not many) are axis-aligned so only x or y needs to be updated. I don't think this fulfills your criteria? Can't test it yet due to lack of skill.

Thank you for the informative replies. Raw opengl is a little beyond me without being a massive rabbit-hole timesink, but when I get more familiar with the api I'll give it a go. Before I get into opengl I'm going to try implementing some cheats to minimise bullet count for the engine (like a bigger bullet that looks like a cluster of smaller bullets).

wintertime · « **Reply #4 on:** November 23, 2013, 04:29:34 pm »

If its predictable you could theoretically calculate the positions 1 frame in advance, do the buffer call to upload VBO 1, then draw VBO 2, swap, calculate and upload VBO 2, draw VBO 1, swap, ...
And maybe add a tiny extra for the few things you could not predict or have 3 buffers to cycle or ...
Maybe add the other buffers with indices where you only delete/update/add those and try to keep the VBO static.
Then profile all methods you can think of to find out which is really faster.

Author Topic: Optimise drawing, beyond using VertexArray? (Read 3545 times)

bobble

Optimise drawing, beyond using VertexArray?

binary1248

Re: Optimise drawing, beyond using VertexArray?

wintertime

Re: Optimise drawing, beyond using VertexArray?

bobble

Re: Optimise drawing, beyond using VertexArray?

wintertime

Re: Optimise drawing, beyond using VertexArray?