I disagree on some of your points binary1248. But before I get into it, I just want to say thank you for your always detailed and intelligent responses. I very much appreciate it.
If after reading this, your mindset remains, "yeah, we just don't care about stock SFML performance with heavy shader use", I understand. I can deal with it on my own. But I just wanted to try to do a little more convincing before I fold my cards.
Reading out of states.shader to set the current program costs a relatively high amount of CPU cycles because of the incurred cache miss. Cache misses ironically also count towards CPU load even though the CPU doesn't actually do anything while it stalls waiting for new data.
Ok, I understand what you're saying. But based on the perf hit I saw, it looked like way more than just a cache miss. Something heavier looks like it's going on with those calls.
I think you forgot that the current program state is specific to each context... Your code would break if you rendered to 2 different sf::RenderTargets with the same shader
Yeah. I thought I addressed that when I said you'd also have to call sf::Shader::bind(NULL) whenever you switch RenderTargets.
(not to mention your static variable would have to be protected by a mutex as well in order to support multi-threaded use).
I thought about that. This was primarily why I called my solution "hacky" - just to demonstrate the general approach. It might be nice if you could set a compile flag on whether to support multi-threaded rendering, and if not, compile out these mutexes (default behaviour). I'd bet the vast majority don't use it. Anybody advanced enough to actually handle multi-threaded rendering can handle dealing with a compiler flag, in CMake or whatever.
This is the reason why the state cache is in the sf::RenderTarget itself. Following on from that, accessing the state cache anywhere outside of sf::RenderTarget would make little to no sense, meaning that this is an optimization that applies solely to applyShader. The uniform binding and everything else inside sf::Shader would be unaffected by this and you would still end up with loads of program changing per frame.
Right. So perhaps a more appropriate place to store this "pLastShaderUsed" pointer would be along with the state cache in the sf::RenderTarget?
Yes, it is an optimization that would apply solely to applyShader. Yet it appears to be a significant optimization, at least in my case.
How common is my case? I don't know. Maybe most people don't use shaders at all in SFML. But in almost all non-trivial 3D games, you generally have a shader on every mesh in the game. And quite often, it is the same shader, to handle lighting and shadows most commonly. This seems to be becoming more popular in 2D games as well, judging from new 2D games I am seeing released on steam, or devlogs on indie game sites. As well as tools popping up out there which create normal, spec, and other maps for your 2D sprites.
e.g.)
sprite dlighte.g.)
sprite illuminator... all of which require shaders on every drawable. And it is in this case that the pLastShaderUsed optimization seems to be a non-trivial improvement.
Let's say you have 100 drawables on screen. It appears to be much quicker to only bind the shader once, rather than 100 times per frame (or actually, 200 bind calls because you set it to NULL after each use).
I'm not against these kinds of optimizations per se, but I still think that the best batching/caching strategy can only be conceived by the user. It is not the point of SFML to take bad code and make good performance out of it. Taking care of application specific optimizations should be left fully to the user. Building in more and more complex caching just so that the user doesn't have to give any thought to what they are doing isn't going to make the situation better.
What about what I am doing strikes you as being bad code? If this appears to be a problem I can circumvent in my own code, I'd be happy to do it. If you have suggestions, I'm all ears. Again, the simple problem is that I reuse the same shader on lots of draw calls. I cannot batch everything up into a single large VertexArray because I have to deal with sorting of different drawables which may have different textures. I use texture atlases, but I can't fit everything I draw into one giant texture.
If your suggestion is to write my own opengl, I guess that's a possibility. But I'm still not entirely convinced that this optimization is some kind of weird edge case unique to my code. But rather it would be a useful optimization for SFML as a whole.
This is one of the reasons SFML is not and will never be a complete game engine. It provides just enough to get people started, but the real meat should still be within their own code. Things like this are so application specific and even dependant on specific circumstances in the application that building them into places as general as sf::RenderTarget just makes something that should have been simple to start with overly complicated. These optimizations always come at a cost, and that cost will always be higher than the gains for the people who actually do optimize in their own code, and the last thing we want to do is punish them for giving more effort than others at making their application run faster.
I understand your point in theory. But in practice, this appears to be something which would be completely invisible to the user. No API change. Everything under the hood. Except faster.
Perhaps we have a different idea of what SFML is. I view it as something that can fully support the low-level rendering needs of a complex 2D game,
without requiring significant changes to the graphics module. So far, that has worked out quite well - SFML has been great! Nobody is saying SFML is meant to be a game engine. We're talking rendering performance. But your feedback seems to imply that it is more meant to be base tutorial code, or something similar, where serious users are required to modify the source to get fast performance? That's not a loaded question, genuinely asking.
The actual reading out of the OpenGL extension variable only happens once...
Yeah, sorry. I get that. What I said about checking Shader::isAvailable was misleading. it's the mutex lock that's the perf problem. And that does get called potentially hundreds of times per frame. I guess I can just nuke that in my local copy since I am not multi-threading. I just wanted you guys to be aware of the perf issue on that, too.
Thanks again for your time.
(edited a couple times for clarity)