Author Topic: Get rid of the state cache inside RenderTarget (Read 13913 times)

binary1248 · « **on:** December 04, 2015, 11:26:23 pm »

I never really liked the idea of this state cache. It makes interoperation between sfml-graphics and other libraries painful, and just complicates RenderTarget code.

I did some testing, and it seems that whether or not the cache is used, the performance stays identical, in both high and low graphics load situations. On first glance, it would seem like saving 1 matrix upload and having to manipulate a few more GL states every frame might have some form of impact. The fact that it doesn't, seems to me like an indication that we just move the work from the calls into GL to constantly having to pre-transform the vertex data on the CPU instead of leaving it to the shader units which are more suited for the job anyway.

Finally getting rid of the cache would make current maintenance of RenderTarget and future extension much easier. I'm already having a hard time figuring out how to support custom shader attributes with this cache around, and it also seems to have an effect on #1015 as well.

For reference, this is the code I used to generate a synthetic load that would actually trigger the state caching:

#include <SFML/Graphics.hpp>
#include <algorithm>
#include <sstream>

int main()
{
    sf::RenderWindow window(sf::VideoMode(800, 600, 32), "Test");

    sf::Clock clock;
    sf::Event event;
    std::stringstream sstr;

    sf::Texture texture;
    texture.loadFromFile("resources/texture.jpg");
    sf::Sprite sprites[4096];

    for (int i = 0; i < 4096; i++)
        sprites[i].setTexture(texture);

    float deltas[256];
    int index = 0;

    while (window.isOpen())
    {
        deltas[index = ((index + 1) % 256)] = clock.restart().asSeconds();

        if (!index)
        {
            float avg = std::accumulate(deltas, deltas + 256, 0.f) / 256.f;
            sstr.str("");
            sstr << "Frame Time: " << avg << "s -- " << (1.f / avg) << " FPS";
            window.setTitle(sstr.str());
        }

        while (window.pollEvent(event))
        {
            if (event.type == sf::Event::Closed)
                return EXIT_SUCCESS;
        }

        window.clear();

        for (int i = 0; i < 4096; i++)
            window.draw(sprites[i]);

        window.display();
    }
}

You can test the performance with and without caching enabled by temporarily hardcoding this line to false.

Laurent · « **Reply #1 on:** December 05, 2015, 12:59:07 pm »

Are you talking about the vertex cache only, or everything (view, blend mode, texture, shader)?

binary1248 · « **Reply #2 on:** December 05, 2015, 02:59:45 pm »

Everything. Most of the "time saved" (not really) is supposed to come from the vertex cache, so I mentioned disabling that. The same applies if you disable everything else as well. For me the performance stays exactly identical (in this synthetic load situation) whether they are on or off.

What you also have to consider is that when people decide to optimize themselves by e.g. using VertexArrays, the vertex cache checks would no longer need to be performed, which means a potential improvement for them. Right now, this vertex cache "optimization" is mostly targeted towards those who render primarily using quads. I think it is safe to say that more advanced users know that there is a lot of potential if they batch themselves into VertexArrays, so this optimization doesn't even affect the right audience if you ask me.

hobby · « **Reply #3 on:** December 05, 2015, 07:50:58 pm »

Quote

Quote
Are you talking about the vertex cache only, or everything (view, blend mode, texture, shader)?
Everything.

Do you suggest to bind the texture every draw call (i.e. comment out this line)?
The recommendation to minimize state changes can be found not only in official guides of OpenGL 1.1 (which SFML fortunately still supports) but also on more contemporary ones (Apple, ARM).

Quote

I think it is safe to say that more advanced users know that there is a lot of potential if they batch themselves into VertexArrays, so this optimization doesn't even affect the right audience if you ask me.

Due to various considerations (e.g. undocumented limit for # of vertices, application data structures that need to be contiguous in memory, etc.), even when batching is used there may be multiple draw() calls sharing the same state. And even if not completely, perhaps partially.

Quote

Finally getting rid of the cache would make current maintenance of RenderTarget and future extension much easier. I'm already having a hard time figuring out how to support custom shader attributes with this cache around, and it also seems to have an effect on #1015 as well.

As a user, I would prefer SFML to stay on the safe side with regard to best practices for performance, just like it does so well for C++. I'm not familiar with the new feature you're implementing or the overall maintenance efforts of the Graphics library, but I can say that I found RenderTarget quite elegant in its design when I implemented #1015.

binary1248 · « **Reply #4 on:** December 05, 2015, 10:18:10 pm »

Quote from: hobby on December 05, 2015, 07:50:58 pm

Do you suggest to bind the texture every draw call (i.e. comment out this line)?
The recommendation to minimize state changes can be found not only in official guides of OpenGL 1.1 (which SFML fortunately still supports) but also on more contemporary ones (Apple, ARM).

I know that changing state often is a bad thing, but what I also know is that changing client state has much less of an impact than changing server state. Anything that has to do with client-side arrays is purely client side i.e. is performed in the driver. Whether we choose to do a bit of caching ourselves or leave it up to the driver process is just a matter of taste. In the end, it is executed by the same CPU anyway, which is why there is no difference. Moving this responsibility over to the driver will let it optimize much better if it sees any opportunities. Not only that, but moving it out of RenderTarget will also make room for newer RenderTarget implementations (using newer OpenGL), which will result in much better performance anyway. There is no point talking about optimizing that last 1% when we know that OpenGL 1.1 is far from the optimal way of doing things anyway.

Quote from: hobby on December 05, 2015, 07:50:58 pm

Due to various considerations (e.g. undocumented limit for # of vertices, application data structures that need to be contiguous in memory, etc.), even when batching is used there may be multiple draw() calls sharing the same state. And even if not completely, perhaps partially.

I'm not saying that it is always possible to draw your whole scene in a single draw call, but the people who count on the current optimization are literally doing nothing themselves to make the situation any better. Cutting down the number of batches will have a greater (and more noticeable) impact than having to enable and disable a few states more than necessary. You should normally start to optimize where it has a greater impact, and not where you don't have to think about it because it is taken care of by someone else.

If you want to see how far batching can go, you can go look at the SFGUI renderers. If you look at the SFGUI test application, there are well over a thousand primitives that get rendered every frame, and that using 3 draw calls. People who say batching isn't possible at all in their application without even trying either have no idea what a batch is, or are just too lazy to optimize performance themselves.

Quote from: hobby on December 05, 2015, 07:50:58 pm

As a user, I would prefer SFML to stay on the safe side with regard to best practices for performance, just like it does so well for C++. I'm not familiar with the new feature you're implementing or the overall maintenance efforts of the Graphics library, but I can say that I found RenderTarget quite elegant in its design when I implemented #1015.

Like I said above, this "best practice" is written more with newer OpenGL in mind, and also has more noticeable effects when applied to newer OpenGL. I'm not a friend of going to great depths just to follow "best practice" even though it doesn't really benefit us and even hampers our efforts to maintain/extend the implementation. Getting rid of the state cache is purely a removal, and I don't know about you, but to me removing code which does not do what it was intended to do is an improvement. Encapsulation even had to be broken in Texture because the state cache had to track which texture was last active, already a minus point.

Laurent · « **Reply #5 on:** December 06, 2015, 01:32:44 pm »

Feel free to test new things, but all I can say is that every single optimization written there has been tested and proved (at the time I did it) to improve performances a lot.

hobby · « **Reply #6 on:** December 06, 2015, 09:14:52 pm »

Quote from: binary1248 on December 05, 2015, 10:18:10 pm

If you want to see how far batching can go, you can go look at the SFGUI renderers. If you look at the SFGUI test application, there are well over a thousand primitives that get rendered every frame, and that using 3 draw calls. People who say batching isn't possible at all in their application without even trying either have no idea what a batch is, or are just too lazy to optimize performance themselves.

No doubt you've gone to great lengths to optimize rendering in SFGUI. I've noticed that you call OpenGL directly, while I draw exclusively using SFML.

If I used glDrawElements() like SFGUI, I wouldn't be able to use SFML for drawing. I believe that my use case is very similar to GUIs, so perhaps I can draw from your experience? (sorry if it's a bit off-topic.)

In my application, I render a large number of convex shapes (fill & outline, with batching). I was sure that glDrawElements() would not have any advantage over RenderTarget:draw() with TrianglesStrip since my shapes were mostly disconnected.
I understand that when drawing text, SFML's rendering does not produce rectangles with shared vertices (due to this). Is it not the case with GUIs in general, given the amount of text they draw? Do you find many of your vertices shared among primitives? Should I abandon RenderTarget::draw() in favor of glDrawElements()?

One more question (and this is more related to RenderTarget state cache), aren't many of the primitives that SFGUI draws opaque (backgrounds, borders, etc.)?

binary1248 · « **Reply #7 on:** December 07, 2015, 02:09:36 am »

Quote from: hobby on December 06, 2015, 09:14:52 pm

I was sure that glDrawElements() would not have any advantage over RenderTarget:draw() with TrianglesStrip since my shapes were mostly disconnected.

This is not a reason not to use a single triangle strip. Degenerate triangles let you jump from primitive to primitive even if they are disconnected without producing any fragments.

Quote from: hobby on December 06, 2015, 09:14:52 pm

I understand that when drawing text, SFML's rendering does not produce rectangles with shared vertices (due to this). Is it not the case with GUIs in general, given the amount of text they draw? Do you find many of your vertices shared among primitives? Should I abandon RenderTarget::draw() in favor of glDrawElements()?

There is no point trying to share vertex positions if at least one attribute (in this case the texture coordinate) is going to change each time. Even if you use indexed rendering, you will still have a bit of duplication in one of your streams. SFML keeps it simple and doesn't even try to share.

Quote from: hobby on December 06, 2015, 09:14:52 pm

One more question (and this is more related to RenderTarget state cache), aren't many of the primitives that SFGUI draws opaque (backgrounds, borders, etc.)?

Text is always non-opaque, due to the anti-aliased glyph backgrounds having to be transparent, the same as in SFML (from which the glyph texture data comes from). The rest of the primitives are mostly opaque by default, because that just happens to be the default theme. If SFGUI had to draw non-opaque widgets, it would be able to as well, using the same rendering methods. You can see this by switching the theme in SFGUI's test application.

hobby · « **Reply #8 on:** December 07, 2015, 06:32:05 am »

If it wasn't clear, I was referring to SFGUI's using glDrawElements() with GL_TRIANGLE_STRIP as opposed to glDrawArrays() with GL_TRIANGLE_STRIP.

I use the latter (indirectly using RenderTarget::draw()), and of course, already make use of degenerate triangles (otherwise I couldn't batch anything). I was wondering if you gained any benefit from using glDrawElements() in SFGUI, as it might have some overhead (CPU/GPU) due to the need to maintain vertex indices. Why not use glDrawArrays() instead?

binary1248 · « **Reply #9 on:** December 09, 2015, 12:56:42 am »

There is a slight difference between glDrawElements() and glDrawRangeElements(). The latter gives the driver/GPU more opportunities to optimize things, which is why SFGUI uses it. There is no glDrawRangeArrays() (and it wouldn't make much sense anyway) so...

hobby · « **Reply #10 on:** December 09, 2015, 12:50:05 pm »

I see. I was looking only at VertexArrayRenderer (for 1.1 support) so I hadn't noticed SFGUI's usage of glDrawRangeElements.

Anyway, thanks to this discussion with you now I understand my options for optimization better. It seems that even the simpler glDrawElements could provide some benefit in my case if I were ready to do my own rendering. For example, I could cut the number of vertices for drawing text by 1/3. It seems that even SFGUI does not bother to do that, so surely it's less relevant to someone trying to use only SFML for drawing. I'm sticking to RenderTarget.draw(VertexArray)

.

eXpl0it3r · « **Reply #11 on:** December 10, 2015, 08:37:50 am »

I'll still have to test the code and see whether it makes a difference, but personally I think it makes sense to remove the optimization in favor of cleaner and better maintainable code, even if there are some scenarios where it might run worse after the removal.

As for future optimization, I think it would be beneficial to document the reasoning behind them somewhere, so when we get back to it years later, it still is known why it was added.

Laurent · « **Reply #12 on:** December 10, 2015, 09:17:10 am »

Quote

I'll still have to test the code and see whether it makes a difference, but personally I think it makes sense to remove the optimization in favor of cleaner and better maintainable code, even if there are some scenarios where it might run worse after the removal.

I don't think we can decide to do it (or not) before a significant part of the SFML users have tested and validated it. Especially users with low-end hardware.

Quote

As for future optimization, I think it would be beneficial to document the reasoning behind them somewhere, so when we get back to it years later, it still is known why it was added.

The reasoning is simple here, it is the same as usual: less OpenGL calls yields better performances.

binary1248 · « **Reply #13 on:** December 10, 2015, 12:36:02 pm »

Quote from: Laurent on December 10, 2015, 09:17:10 am

I don't think we can decide to do it (or not) before a significant part of the SFML users have tested and validated it. Especially users with low-end hardware.

This is always the case. Nothing different here.

Quote from: Laurent on December 10, 2015, 09:17:10 am

The reasoning is simple here, it is the same as usual: less OpenGL calls yields better performances.

This is like saying: Less standard library calls yields better performance.

I don't think this is the case if we end up implementing the algorithms ourselves instead.

The same can be said about OpenGL. We don't know whether we are doing exactly the same work as the driver would have done. As such, I don't think it is as simple as "minimize the number of calls". That is just a hint at where to look for bottlenecks... The only way to get true performance data is to do the measurements ourselves, and these measurements tell me that this optimization doesn't end up behaving like one.

Laurent · « **Reply #14 on:** December 10, 2015, 01:47:10 pm »

We could start arguing endlessly, but I think this topic will be much more useful if we focus on the important facts:
- the current code is the result of precise and intensive tests, not random code that I wrote in the hope that it might improve performances
- I'll be the first one to be happy about a total removal of the cache stuff, so go ahead, but don't draw conclusions to quickly, make as many users as possible to test the modifications (not just a few (or even none) as usual)

Author Topic: Get rid of the state cache inside RenderTarget (Read 13913 times)

binary1248

Get rid of the state cache inside RenderTarget

Laurent

Re: Get rid of the state cache inside RenderTarget

binary1248

Re: Get rid of the state cache inside RenderTarget

hobby

Re: Get rid of the state cache inside RenderTarget

binary1248

Re: Get rid of the state cache inside RenderTarget

Laurent

Re: Get rid of the state cache inside RenderTarget

hobby

Re: Get rid of the state cache inside RenderTarget

binary1248

Re: Get rid of the state cache inside RenderTarget

hobby

Re: Get rid of the state cache inside RenderTarget

binary1248

Re: Get rid of the state cache inside RenderTarget

hobby

Re: Get rid of the state cache inside RenderTarget

eXpl0it3r

Re: Get rid of the state cache inside RenderTarget

Laurent

Re: Get rid of the state cache inside RenderTarget

binary1248

Re: Get rid of the state cache inside RenderTarget

Laurent

Re: Get rid of the state cache inside RenderTarget