Do you suggest to bind the texture every draw call (i.e. comment out this line (https://github.com/SFML/SFML/blob/master/src/SFML/Graphics/RenderTarget.cpp#L255))?QuoteAre you talking about the vertex cache only, or everything (view, blend mode, texture, shader)?Everything.
I think it is safe to say that more advanced users know that there is a lot of potential if they batch themselves into VertexArrays, so this optimization doesn't even affect the right audience if you ask me.Due to various considerations (e.g. undocumented limit for # of vertices, application data structures that need to be contiguous in memory, etc.), even when batching is used there may be multiple draw() calls sharing the same state. And even if not completely, perhaps partially.
Finally getting rid of the cache would make current maintenance of RenderTarget and future extension much easier. I'm already having a hard time figuring out how to support custom shader attributes with this cache around, and it also seems to have an effect on #1015 (https://github.com/SFML/SFML/pull/1015) as well.As a user, I would prefer SFML to stay on the safe side with regard to best practices for performance, just like it does so well for C++. I'm not familiar with the new feature you're implementing or the overall maintenance efforts of the Graphics library, but I can say that I found RenderTarget quite elegant in its design when I implemented #1015.
Do you suggest to bind the texture every draw call (i.e. comment out this line (https://github.com/SFML/SFML/blob/master/src/SFML/Graphics/RenderTarget.cpp#L255))?I know that changing state often is a bad thing, but what I also know is that changing client state has much less of an impact than changing server state. Anything that has to do with client-side arrays is purely client side i.e. is performed in the driver. Whether we choose to do a bit of caching ourselves or leave it up to the driver process is just a matter of taste. In the end, it is executed by the same CPU anyway, which is why there is no difference. Moving this responsibility over to the driver will let it optimize much better if it sees any opportunities. Not only that, but moving it out of RenderTarget will also make room for newer RenderTarget implementations (using newer OpenGL), which will result in much better performance anyway. There is no point talking about optimizing that last 1% when we know that OpenGL 1.1 is far from the optimal way of doing things anyway.
The recommendation to minimize state changes can be found not only in official guides of OpenGL 1.1 (which SFML fortunately still supports) but also on more contemporary ones (Apple (https://developer.apple.com/library/ios/documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/OpenGLESApplicationDesign/OpenGLESApplicationDesign.html), ARM (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0555c/CHDCCCEG.html)).
Due to various considerations (e.g. undocumented limit for # of vertices, application data structures that need to be contiguous in memory, etc.), even when batching is used there may be multiple draw() calls sharing the same state. And even if not completely, perhaps partially.I'm not saying that it is always possible to draw your whole scene in a single draw call, but the people who count on the current optimization are literally doing nothing themselves to make the situation any better. Cutting down the number of batches will have a greater (and more noticeable) impact than having to enable and disable a few states more than necessary. You should normally start to optimize where it has a greater impact, and not where you don't have to think about it because it is taken care of by someone else.
As a user, I would prefer SFML to stay on the safe side with regard to best practices for performance, just like it does so well for C++. I'm not familiar with the new feature you're implementing or the overall maintenance efforts of the Graphics library, but I can say that I found RenderTarget quite elegant in its design when I implemented #1015.Like I said above, this "best practice" is written more with newer OpenGL in mind, and also has more noticeable effects when applied to newer OpenGL. I'm not a friend of going to great depths just to follow "best practice" even though it doesn't really benefit us and even hampers our efforts to maintain/extend the implementation. Getting rid of the state cache is purely a removal, and I don't know about you, but to me removing code which does not do what it was intended to do is an improvement. Encapsulation even had to be broken in Texture because the state cache had to track which texture was last active, already a minus point.
If you want to see how far batching can go, you can go look at the SFGUI renderers. If you look at the SFGUI test application, there are well over a thousand primitives that get rendered every frame, and that using 3 draw calls. People who say batching isn't possible at all in their application without even trying either have no idea what a batch is, or are just too lazy to optimize performance themselves.No doubt you've gone to great lengths to optimize rendering in SFGUI. I've noticed that you call OpenGL directly, while I draw exclusively using SFML.
I was sure that glDrawElements() would not have any advantage over RenderTarget:draw() with TrianglesStrip since my shapes were mostly disconnected.This is not a reason not to use a single triangle strip. Degenerate triangles let you jump from primitive to primitive even if they are disconnected without producing any fragments.
I understand that when drawing text, SFML's rendering does not produce rectangles with shared vertices (due to this (https://github.com/SFML/SFML/blob/master/src/SFML/Graphics/Text.cpp#L372,L373)). Is it not the case with GUIs in general, given the amount of text they draw? Do you find many of your vertices shared among primitives? Should I abandon RenderTarget::draw() in favor of glDrawElements()?There is no point trying to share vertex positions if at least one attribute (in this case the texture coordinate) is going to change each time. Even if you use indexed rendering, you will still have a bit of duplication in one of your streams. SFML keeps it simple and doesn't even try to share.
One more question (and this is more related to RenderTarget state cache), aren't many of the primitives that SFGUI draws opaque (backgrounds, borders, etc.)?Text is always non-opaque, due to the anti-aliased glyph backgrounds having to be transparent, the same as in SFML (from which the glyph texture data comes from). The rest of the primitives are mostly opaque by default, because that just happens to be the default theme. If SFGUI had to draw non-opaque widgets, it would be able to as well, using the same rendering methods. You can see this by switching the theme in SFGUI's test application.
I'll still have to test the code and see whether it makes a difference, but personally I think it makes sense to remove the optimization in favor of cleaner and better maintainable code, even if there are some scenarios where it might run worse after the removal.I don't think we can decide to do it (or not) before a significant part of the SFML users have tested and validated it. Especially users with low-end hardware.
As for future optimization, I think it would be beneficial to document the reasoning behind them somewhere, so when we get back to it years later, it still is known why it was added.The reasoning is simple here, it is the same as usual: less OpenGL calls yields better performances.
I don't think we can decide to do it (or not) before a significant part of the SFML users have tested and validated it. Especially users with low-end hardware.This is always the case. Nothing different here.
The reasoning is simple here, it is the same as usual: less OpenGL calls yields better performances.This is like saying: Less standard library calls yields better performance.
With caching: ~57 FPS
Without caching: ~59 FPS
I have *no* idea why it's so damn slow in both cases, but at least there's a slight effect.
02:00.0 VGA compatible controller: NVIDIA Corporation GK104 [GeForce GTX 660 Ti] (rev a1)
Driver enabled vsync maybe?Nope, it's not vsync. With 1024 sprites I get ~147.
I think in this case the 2 fps difference is more likely just noise and not something which comes from caching/non-caching.I don't think so, as I've watched the numbers for like a minute. It's reproducible as well.
Updated code:
...