After running a few of the current project highlights through AMD GPU PerfStudio, I noticed something really interesting with the way
Cendric causes OpenGL operations to be submitted to the driver.
Basically, for really huge vertex arrays, a disproportionately large amount of the time spent calling OpenGL is spent in the actual draw calls themselves. This is in stark contrast to what many might believe is the source of all the time spent using OpenGL.
The time spent in those draw calls isn't actually the time the GPU needs to render anything. Commands are still queued and submitted to the GPU to be executed asynchronously. This also isn't the time that "just has to be spent" doing what any draw call would do, otherwise the draw calls that process only 4 vertices should take just as much time.
Keeping the definitions of gl*Pointer and glDrawArray functions in mind, the spent time almost certainly has to come from the copying of the vertex data from the address space of the application into the address space of the driver for asynchronous DMA transfer to the GPU.
In this case, 19600 sf::Vertex worth of data had to be copied on every draw call (this is assuming the driver is smart and combines copying the 3 memory blocks specified by the pointers into a single copy). Since each sf::Vertex is 20 bytes big, this memory block is 392 KB. For reasons the author will know, it is submitted 8 times per frame for a total of 3136 KB and assuming we want to run at a minimum of 60 FPS, 188.16 MB per second.
Yes... 188.16 MB/s is still a long way from the GB/s memory bandwidth that we have within system RAM and between CPU and GPU, but still... If you consider that in this example, 50% of the CPU time the application needs per frame is spent merely copying data around it makes you wonder if there are any better alternatives.
(Again, this was all assuming the driver isn't stupid and making 3 copies per draw which would make it 564.48 MB/s)
I wouldn't call this a bandwidth bottleneck as is. The FPS in this game is still probably bottlenecked more by some GPU-specific factors e.g. fillrate etc. This is about letting the CPU do more useful things at the same time the GPU is eating through those commands.
Which leads me to the question: Would it make sense to introduce something like sf::VertexBuffer?
sf::VertexBuffer would be something in between sf::VertexArray and sf::Texture. Just like sf::Texture, sf::VertexBuffer would live in the GPU while it is alive, and like sf::VertexArray, it would contain an array of sf::Vertex data. Unlike sf::VertexArray, reading from an sf::VertexBuffer would be just as expensive as reading from an sf::Texture since it would incur a GPU-CPU readback. However, considering it is a very common use case to only submit data without having to read it back this wouldn't be that big of a problem.
The main advantage of sf::VertexBuffer over sf::VertexArray would be that because it lives on the GPU, it won't have to be copied every draw call. Again, this isn't just about saving memory bandwidth. In all games I measured it was never a bottleneck, although one must consider I have a pretty high end system so it might become a bottleneck on crappier systems. What is guaranteed is that the drawing thread (in Cendric's case there is only a single thread) is free to do other useful things more often during a single frame instead of spending a lot of time waiting for memory to be copied. This can be things like AI, physics, sound etc. Even if the final FPS of the current games would not increase, it would leave the authors with more room to do other interesting CPU-side things that might not have been possible because the game became CPU-bound. In the case that one day games do become memory bandwidth-bound, keeping as much data in GPU memory as possible will also help to reduce the bottleneck.
So, what I want to ask is: Is there anybody out there who would actually, at the current time, benefit from keeping vertex data on the GPU using an sf::VertexBuffer? I have a feeling that Cendric would, but until there is an implementation to test with, it is just a theory.