Author Topic: SFML graphics perf analysis (Read 7231 times)

Jabberwocky · « **on:** February 20, 2017, 04:29:16 pm »

Hello SFML people,

I was doing some CPU perf testing on my game. As expected, graphics-related stuff takes up quite a bit of the overall CPU usage. But I did find some interesting hot spots in SFML I wanted to discuss.

Some upfront info:

1. I am using SFML 2.3 on a new windows 10 laptop, NVIDIA card.

2. My game is fairly graphically intense, from a usual SFML standpoint. For example, I use a lot of shaders, I use several render textures which are updated per-frame, and I draw a lot of stuff (using VertexArrays where possible)

So,

sf::RenderTarget::draw(const Vertex* vertices, ...) 

is a hotspot, as you might expect. But what I didn't expect was the following (these are all lines of code from this function):

This line takes up about 20% of the CPU work done by RenderTarget::Draw:

    if (activate(true))

... which is because of a call to WglContext::makeCurrent()
Is this something which needs to be done every time draw is called?
Perhaps with the most recent context changes to SFML 2.4 this is no longer an issue?
Or perhaps this is a symptom of the fact I update several different RenderTextures each frame? (I ensure to batch up all the operations on a single RenderTexture before moving on to a different one.)

These lines take up over 20% of the CPU work done by RenderTarget::Draw:

   applyShader(states.shader)

   applyShader(NULL);

The expensive aspects of these applyShader calls are because of:
1. Shader::isAvailable is called every time, which takes a mutex lock. This seems very wasteful for each draw call.

2. GLEXT_glUseProgramObject is called first on the shader program, then on NULL for every call. This is perhaps wasteful for a program which reuses the same shader across many draw calls. Would it be possible to cache the last used shader, and only call GLEXT_glUseProgramObject if the shader has changed?

This line takes up most of the remaining CPU (~55%), which I would expect:

   glCheck(glDrawArrays(mode, 0, vertexCount));

Thanks for any thoughts you have to share.

Laurent · « **Reply #1 on:** February 20, 2017, 05:31:59 pm »

Quote

Perhaps with the most recent context changes to SFML 2.4 this is no longer an issue?

Probably (why don't you test the latest version?), but binary1248 will be able to tell you more about this subject.

Quote

Would it be possible to cache the last used shader, and only call GLEXT_glUseProgramObject if the shader has changed?

Caching shader objects is possible, but much harder than textures or states.

binary1248 · « **Reply #2 on:** February 20, 2017, 07:37:06 pm »

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

This line takes up about 20% of the CPU work done by RenderTarget::Draw:
if (activate(true))
... which is because of a call to WglContext::makeCurrent()
Is this something which needs to be done every time draw is called?

The call itself doesn't actually take up that much if any CPU, it's the side-effects that the API dictates that cause synchronization. See here:

Quote

Before switching to the new rendering context, OpenGL flushes any previous rendering context that was current to the calling thread.

If you loaded up the CPU side of the driver with commands, switching away from a "full" context will always cause a flush to the GPU, meaning that you are forcing the driver to finally do some of the work it had been piling up for a while.

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

Perhaps with the most recent context changes to SFML 2.4 this is no longer an issue?
Or perhaps this is a symptom of the fact I update several different RenderTextures each frame? (I ensure to batch up all the operations on a single RenderTexture before moving on to a different one.)

It depends on the application whether there will be more or less switching between contexts, but as I already said, it's not the switching that matters but the actual work you queued up. At some point it is going to have to be done. It's just a matter of whether you are measuring the CPU usage when it actually happens.

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

These lines take up over 20% of the CPU work done by RenderTarget::Draw:
applyShader(states.shader)
applyShader(NULL);

This is mostly due to really horrible CPU cache usage when trying to be smart about caching GL states. Having to hop around memory a lot isn't a fun thing to do.

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

The expensive aspects of these applyShader calls are because of:
1. Shader::isAvailable is called every time, which takes a mutex lock. This seems very wasteful for each draw call.

Trust me... if you think this is bad, you don't want to know what it looks like inside the driver itself.

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

2. GLEXT_glUseProgramObject is called first on the shader program, then on NULL for every call. This is perhaps wasteful for a program which reuses the same shader across many draw calls. Would it be possible to cache the last used shader, and only call GLEXT_glUseProgramObject if the shader has changed?

Yes... this is rather suboptimal, but also what it has to look like if you don't want to utterly break compatibility with older behaviour or as Laurent said keep the rendering rather simple. If you use the same program across multiple draws, you might want to consider batching them together yourself any way.

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm

This line takes up most of the remaining CPU (~55%), which I would expect:
glCheck(glDrawArrays(mode, 0, vertexCount));

Due to the really asynchronous nature of OpenGL, I wouldn't really draw these kinds of conclusions too prematurely...

Jabberwocky · « **Reply #3 on:** February 20, 2017, 09:34:16 pm »

Quote from: binary1248 on February 20, 2017, 07:37:06 pm

If you loaded up the CPU side of the driver with commands, switching away from a "full" context will always cause a flush to the GPU, meaning that you are forcing the driver to finally do some of the work it had been piling up for a while.

Gotcha.
Fully understood.

Quote from: binary1248 on February 20, 2017, 07:37:06 pm

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm
These lines take up over 20% of the CPU work done by RenderTarget::Draw:
applyShader(states.shader)
applyShader(NULL);
This is mostly due to really horrible CPU cache usage when trying to be smart about caching GL states. Having to hop around memory a lot isn't a fun thing to do.

I'm not sure I understand you here.

I tried hacking in a fairly simple fix for these unnecessary calls to GLEXT_glUseProgramObject.
Here's the general idea:

void Shader::bind(const Shader* shader)
{
    static const sf::Shader* pLastUsedShader = nullptr;
    bool bNewShader = false;
    if (pLastUsedShader != shader)
    {
       bNewShader = true;
       pLastUsedShader = shader;
    }

    // Only call GLEXT_glUseProgramObject if we're actually changing shaders.
    // If we're using the same shader, that would be a wasteful perf drain.
    if (bNewShader)
    {
       if (shader && shader->m_shaderProgram)
       {
          // Enable the program
          glCheck(GLEXT_glUseProgramObject(castToGlHandle(shader->m_shaderProgram)));
       }
       else
       {
          // Bind no shader
          glCheck(GLEXT_glUseProgramObject(0));
       }
    }

    if (shader && shader->m_shaderProgram)
    {
       // Bind the textures
       shader->bindTextures();

       // Bind the current texture
       if (shader->m_currentTexture != -1)
          glCheck(GLEXT_glUniform1i(shader->m_currentTexture, 0));
    }
}
 

... although some other minor changes were also needed. You need to call sf::Shader::bind(NULL) whenever you activate a new RenderTarget. And in RenderTarget::Draw, you call applyShader(states.shader) even if it is NULL, to make sure to remove any previously set shader.

Is there anything particularly wrong with this approach? It seems to work fine in my game.

Quote from: binary1248 on February 20, 2017, 07:37:06 pm

Quote from: Jabberwocky on February 20, 2017, 04:29:16 pm
The expensive aspects of these applyShader calls are because of:
1. Shader::isAvailable is called every time, which takes a mutex lock. This seems very wasteful for each draw call.
Trust me... if you think this is bad, you don't want to know what it looks like inside the driver itself.

Sure. But if it's an unnecessary drain on CPU perf, why do it? I mean, shouldn't we only have to check if shaders are supported once? When the program starts? And not every draw call?

binary1248 · « **Reply #4 on:** February 21, 2017, 01:36:23 am »

Quote from: Jabberwocky on February 20, 2017, 09:34:16 pm

I'm not sure I understand you here.

Reading out of states.shader to set the current program costs a relatively high amount of CPU cycles because of the incurred cache miss. Cache misses ironically also count towards CPU load even though the CPU doesn't actually do anything while it stalls waiting for new data.

Quote from: Jabberwocky on February 20, 2017, 09:34:16 pm

I tried hacking in a fairly simple fix for these unnecessary calls to GLEXT_glUseProgramObject.
... although some other minor changes were also needed. You need to call sf::Shader::bind(NULL) whenever you activate a new RenderTarget. And in RenderTarget::Draw, you call applyShader(states.shader) even if it is NULL, to make sure to remove any previously set shader.

Is there anything particularly wrong with this approach? It seems to work fine in my game.

I think you forgot that the current program state is specific to each context... Your code would break if you rendered to 2 different sf::RenderTargets with the same shader (not to mention your static variable would have to be protected by a mutex as well in order to support multi-threaded use). This is the reason why the state cache is in the sf::RenderTarget itself. Following on from that, accessing the state cache anywhere outside of sf::RenderTarget would make little to no sense, meaning that this is an optimization that applies solely to applyShader. The uniform binding and everything else inside sf::Shader would be unaffected by this and you would still end up with loads of program changing per frame.

I'm not against these kinds of optimizations per se, but I still think that the best batching/caching strategy can only be conceived by the user. It is not the point of SFML to take bad code and make good performance out of it. Taking care of application specific optimizations should be left fully to the user. Building in more and more complex caching just so that the user doesn't have to give any thought to what they are doing isn't going to make the situation better. This is one of the reasons SFML is not and will never be a complete game engine. It provides just enough to get people started, but the real meat should still be within their own code. Things like this are so application specific and even dependant on specific circumstances in the application that building them into places as general as sf::RenderTarget just makes something that should have been simple to start with overly complicated. These optimizations always come at a cost, and that cost will always be higher than the gains for the people who actually do optimize in their own code, and the last thing we want to do is punish them for giving more effort than others at making their application run faster.

Quote from: Jabberwocky on February 20, 2017, 09:34:16 pm

Sure. But if it's an unnecessary drain on CPU perf, why do it? I mean, shouldn't we only have to check if shaders are supported once? When the program starts? And not every draw call?

The actual reading out of the OpenGL extension variable only happens once... This is already a bit shaky, because it isn't guaranteed anywhere in the specification that multiple distinct contexts have to support the exact same extensions, but experience has shown that this is mostly true. If we had the luxury of C++11 synchronized static initialization (even using lambdas if you want to get really fancy), the synchronization wouldn't be an issue either. That whole function call just becomes a lock and read from 2 bools. Fact is, C++98 doesn't guarantee that 2 threads that simultaneously call sf::Shader::isAvailable() are going to behave without explicit synchronization. Yeah... sucks for those who still use or have to support C++98, but we are in 2017, there are better alternatives now, and we here at SFML know this as well.

Jabberwocky · « **Reply #5 on:** February 21, 2017, 02:47:01 am »

I disagree on some of your points binary1248. But before I get into it, I just want to say thank you for your always detailed and intelligent responses. I very much appreciate it.

If after reading this, your mindset remains, "yeah, we just don't care about stock SFML performance with heavy shader use", I understand. I can deal with it on my own. But I just wanted to try to do a little more convincing before I fold my cards.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

Reading out of states.shader to set the current program costs a relatively high amount of CPU cycles because of the incurred cache miss. Cache misses ironically also count towards CPU load even though the CPU doesn't actually do anything while it stalls waiting for new data.

Ok, I understand what you're saying. But based on the perf hit I saw, it looked like way more than just a cache miss. Something heavier looks like it's going on with those calls.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

I think you forgot that the current program state is specific to each context... Your code would break if you rendered to 2 different sf::RenderTargets with the same shader

Yeah. I thought I addressed that when I said you'd also have to call sf::Shader::bind(NULL) whenever you switch RenderTargets.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

(not to mention your static variable would have to be protected by a mutex as well in order to support multi-threaded use).

I thought about that. This was primarily why I called my solution "hacky" - just to demonstrate the general approach. It might be nice if you could set a compile flag on whether to support multi-threaded rendering, and if not, compile out these mutexes (default behaviour). I'd bet the vast majority don't use it. Anybody advanced enough to actually handle multi-threaded rendering can handle dealing with a compiler flag, in CMake or whatever.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

This is the reason why the state cache is in the sf::RenderTarget itself. Following on from that, accessing the state cache anywhere outside of sf::RenderTarget would make little to no sense, meaning that this is an optimization that applies solely to applyShader. The uniform binding and everything else inside sf::Shader would be unaffected by this and you would still end up with loads of program changing per frame.

Right. So perhaps a more appropriate place to store this "pLastShaderUsed" pointer would be along with the state cache in the sf::RenderTarget?

Yes, it is an optimization that would apply solely to applyShader. Yet it appears to be a significant optimization, at least in my case.

How common is my case? I don't know. Maybe most people don't use shaders at all in SFML. But in almost all non-trivial 3D games, you generally have a shader on every mesh in the game. And quite often, it is the same shader, to handle lighting and shadows most commonly. This seems to be becoming more popular in 2D games as well, judging from new 2D games I am seeing released on steam, or devlogs on indie game sites. As well as tools popping up out there which create normal, spec, and other maps for your 2D sprites.

e.g.) sprite dlight
e.g.) sprite illuminator

... all of which require shaders on every drawable. And it is in this case that the pLastShaderUsed optimization seems to be a non-trivial improvement.

Let's say you have 100 drawables on screen. It appears to be much quicker to only bind the shader once, rather than 100 times per frame (or actually, 200 bind calls because you set it to NULL after each use).

Quote from: binary1248 on February 21, 2017, 01:36:23 am

I'm not against these kinds of optimizations per se, but I still think that the best batching/caching strategy can only be conceived by the user. It is not the point of SFML to take bad code and make good performance out of it. Taking care of application specific optimizations should be left fully to the user. Building in more and more complex caching just so that the user doesn't have to give any thought to what they are doing isn't going to make the situation better.

What about what I am doing strikes you as being bad code? If this appears to be a problem I can circumvent in my own code, I'd be happy to do it. If you have suggestions, I'm all ears. Again, the simple problem is that I reuse the same shader on lots of draw calls. I cannot batch everything up into a single large VertexArray because I have to deal with sorting of different drawables which may have different textures. I use texture atlases, but I can't fit everything I draw into one giant texture.

If your suggestion is to write my own opengl, I guess that's a possibility. But I'm still not entirely convinced that this optimization is some kind of weird edge case unique to my code. But rather it would be a useful optimization for SFML as a whole.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

This is one of the reasons SFML is not and will never be a complete game engine. It provides just enough to get people started, but the real meat should still be within their own code. Things like this are so application specific and even dependant on specific circumstances in the application that building them into places as general as sf::RenderTarget just makes something that should have been simple to start with overly complicated. These optimizations always come at a cost, and that cost will always be higher than the gains for the people who actually do optimize in their own code, and the last thing we want to do is punish them for giving more effort than others at making their application run faster.

I understand your point in theory. But in practice, this appears to be something which would be completely invisible to the user. No API change. Everything under the hood. Except faster.

Perhaps we have a different idea of what SFML is. I view it as something that can fully support the low-level rendering needs of a complex 2D game, without requiring significant changes to the graphics module. So far, that has worked out quite well - SFML has been great! Nobody is saying SFML is meant to be a game engine. We're talking rendering performance. But your feedback seems to imply that it is more meant to be base tutorial code, or something similar, where serious users are required to modify the source to get fast performance? That's not a loaded question, genuinely asking.

Quote from: binary1248 on February 21, 2017, 01:36:23 am

The actual reading out of the OpenGL extension variable only happens once...

Yeah, sorry. I get that. What I said about checking Shader::isAvailable was misleading. it's the mutex lock that's the perf problem. And that does get called potentially hundreds of times per frame. I guess I can just nuke that in my local copy since I am not multi-threading. I just wanted you guys to be aware of the perf issue on that, too.

Thanks again for your time.

(edited a couple times for clarity)

binary1248 · « **Reply #6 on:** February 21, 2017, 05:37:37 pm »

Quote from: Jabberwocky on February 21, 2017, 02:47:01 am

Ok, I understand what you're saying. But based on the perf hit I saw, it looked like way more than just a cache miss. Something heavier looks like it's going on with those calls.

All the stuff inside sf::Shader::bind() is going on... It's not nothing but it's also not the world. In my profiling, I've seen a higher than average amount of time spent in there, this is true, but nothing so disproportionately high that it just screams for some kind of optimization.

Quote from: Jabberwocky on February 21, 2017, 02:47:01 am

How common is my case? I don't know. Maybe most people don't use shaders at all in SFML. But in almost all non-trivial 3D games, you generally have a shader on every mesh in the game. And quite often, it is the same shader, to handle lighting and shadows most commonly. This seems to be becoming more popular in 2D games as well, judging from new 2D games I am seeing released on steam, or devlogs on indie game sites. As well as tools popping up out there which create normal, spec, and other maps for your 2D sprites.

I know that shaders are becoming more and more popular over time, and I'm not trying to deny that making sure they are worth using in SFML is important. I just want to make sure everybody is aware that the SFML code base started out quite a while ago, and the current drawable API design still has strings attached to its previous history. If Laurent were to design it from scratch it would probably look different from what it is, and also be faster and easier to optimize than it is now. You really don't have to tell me that the way SFML renders is "sub-optimal", to put it nicely. I was one of the first ones to state this opinion publicly, even before I was part of the team. If you look at the SFGUI renderers, I did everything I could to minimize state changes. I just like you, hope that there will be fresh wind when SFML 3 comes around, and you don't have to worry, this time performance will be taken into account from the get go.

Quote from: Jabberwocky on February 21, 2017, 02:47:01 am

What about what I am doing strikes you as being bad code? If this appears to be a problem I can circumvent in my own code, I'd be happy to do it. If you have suggestions, I'm all ears.

I just stated that as an example of something some people might expect. If you look at the way some games "out there" are developed, one can't help but get these pictures of people who really think engines/libraries can work miracles without having to invest any effort themselves.

Quote from: Jabberwocky on February 21, 2017, 02:47:01 am

Again, the simple problem is that I reuse the same shader on lots of draw calls. I cannot batch everything up into a single large VertexArray because I have to deal with sorting of different drawables which may have different textures. I use texture atlases, but I can't fit everything I draw into one giant texture.

If your suggestion is to write my own opengl, I guess that's a possibility. But I'm still not entirely convinced that this optimization is some kind of weird edge case unique to my code. But rather it would be a useful optimization for SFML as a whole.

I can really only give you the same answers people are given when they ask if they should be using Vulkan instead of OpenGL. Using OpenGL isn't a bad thing in all cases. Experts even recommend to just keep using it if you feel comfortable using it and you can meet your requirements. Once you start getting into difficulties, you can start to employ some well known patterns/tricks and see how much they help. At some point, ultimately the tricks also don't help any more and you will have to rise to the "next level". For those people, that next level is Vulkan, for you it is OpenGL.

Quote from: Jabberwocky on February 21, 2017, 02:47:01 am

I understand your point in theory. But in practice, this appears to be something which would be completely invisible to the user. No API change. Everything under the hood. Except faster.

Perhaps we have a different idea of what SFML is. I view it as something that can fully support the low-level rendering needs of a complex 2D game, without requiring significant changes to the graphics module. So far, that has worked out quite well - SFML has been great! Nobody is saying SFML is meant to be a game engine. We're talking rendering performance. But your feedback seems to imply that it is more meant to be base tutorial code, or something similar, where serious users are required to modify the source to get fast performance? That's not a loaded question, genuinely asking.

I never expect anybody to have to modify SFML source to be able to use it in some productive way. If something has to be modified then it should be either to work around a fresh bug, or implement a feature that hasn't made it into master yet. SFML also shouldn't be treated as a glorified tutorial. It's just another tool in your toolbox. You have to decide for yourself which tool fits the task at hand the best. We don't have anything against people using OpenGL for rendering instead of the drawable API, that is the whole reason why interoperation is supported in the first place. All we ask is that people who prefer using one over the other state their reasoning so that we can consider it when making future API/implementation decisions. SFML is an open source zlib library. We don't owe anyone anything and neither do they owe us anything. It's nice when people can make the most out of what they are given, and even nicer when they contribute back to make future versions of the library even better at it. We know that SFML could be much more, and hopefully it will become something more in the future, but for now, we all have to make the best out of what we have.

Jabberwocky · « **Reply #7 on:** February 21, 2017, 06:51:02 pm »

Ok.

I definitely cast my vote (for whatever that's worth) towards performance being a core consideration with SFML 3. Although I understand that is likely a long way off.

I appreciate your insights into some of the performance issues I've encountered, binary1248. I can probably knock off some low hanging fruit on my own.

Author Topic: SFML graphics perf analysis (Read 7231 times)

Jabberwocky

SFML graphics perf analysis

Laurent

Re: SFML graphics perf analysis

binary1248

Re: SFML graphics perf analysis

Jabberwocky

Re: SFML graphics perf analysis

binary1248

Re: SFML graphics perf analysis

Jabberwocky

Re: SFML graphics perf analysis

binary1248

Re: SFML graphics perf analysis

Jabberwocky

Re: SFML graphics perf analysis