Author Topic: SFGUI (0.4.0 released) (Read 391537 times)

binary1248 · « **Reply #255 on:** January 28, 2012, 07:17:19 pm »

Quote from: "Laurent"

I was saying that using such high numbers were irrelevant to show optimization results.

Well regarding optimization results, the more the better right? So if you have a library that runs at 2000 FPS instead of 1000 FPS it is sure to run faster on a slower system, say from 100 FPS to 200 FPS? Until you are sure that your GPU and CPU are both fully loaded you can always optimize no matter what the FPS value is. Optimization is not about maximizing the absolute FPS you get on one system, but rather eliminating the bottlenecks on all systems at the same time to ensure it will run faster (meaning more FPS regardless of the system) on all systems instead of at a given FPS value on one given system.

Therefore high numbers are indeed relevant, they show that a certain optimization does have a positive effect. And that effect will carry over to all systems, not only my own. Not everybody has a modern (in the last 2 years) GPU, and Tank's +2400 FPS might translate to e.g. +100 FPS from 50 FPS for them, which is very desirable.

You have to open up for a wider range of hardware if you really want to support OpenGL ES

Laurent · « **Reply #256 on:** January 28, 2012, 07:38:28 pm »

You don't get it. Let me explain again what I mean.

At 3600 FPS, one frame takes less than 300 microseconds to render. At such low durations, anything takes a significant part of the result, even event/window handling -- and basically, everything that happens only once per frame and that you don't want to see in the result. Even running a music player in the background could make a difference.

In one frame you draw many widgets, but you also do a lot of small things that are irrelevant to what you want to optimize. If you want to focus on the widgets themselves, you must draw many more of them so that the rest becomes really negligible. In my opinion, you must be below 100 FPS if you want to be credible.

Or... you should just say "overall performance improvement of ~200%" and not give too many details

binary1248 · « **Reply #257 on:** January 28, 2012, 08:48:22 pm »

Yeah, I know that anything running on my computer could skew the results. I even actually get much lower FPS when running other CPU or GPU intensive applications at the same time as testing. But when I test the real performance of the library I make sure that my test environment is as clean as possible a.k.a CPU and GPU load < 5% which is a decent margin of error.

Browsing the web or using "typical" everyday applications implies the same window management / event handling as anything else running on my computer and that doesn't even load my CPU past 5% which means the same OS overhead would apply to an SFML application too. Which leaves the other 95% of usage purely to the "essential" part of the app (drawing and what not).

We could just shove 1000 buttons into our ScrolledWindow and turn off culling to purposely get the FPS down to 100 FPS if that's what it takes to get reliable test results. But why do so if we can see the difference at much higher FPS values?

Being a hobby physicist I just have to throw in an example:

Consider you want to measure the speed of light. And you would do so by measuring the duration it takes for a laser pulse to travel a certain distance.

You could measure it by getting a 10KM long fiber optic cable and sending a pulse through it and measuring the time it takes through that. Or you could get a 10M long fiber optic cable and measure the time it takes to travel through that with a high precision measuring device. You would resort to the first method because it would seem less susceptible to interference and margin of error of the time measuring device. But because we made sure the test environment was clean and the same in both cases we can also resort to the second method.

The key when testing is making sure you can reproduce your results, which in turn means the environment is fully understood and taken into account. The FPS values stated here are reproducible between system restarts and the relative FPS gain among the testers is also consistent which means that the FPS values themselves are in fact a reliable method of measuring performance, even at such high values.

Also worth reading: Amdahl's law

Tank · « **Reply #258 on:** January 28, 2012, 09:25:16 pm »

Quote

So you had time to test the new API before switching to OpenGL. Was it better (performances and usage) than the old one? I'm interested to see how it performs for GUI systems

Yep, we also took the time to see how it performs. It was definitely better performance. Before the new graphics API the highest FPS we could get was ~400 (SFGUI test application), with our custom culling and using display lists we could get it up to ~1,600 FPS.

The new graphics API put out 1,200 FPS, without custom culling and without display lists. So it was still slower than the old API together with our optimizations, but faster without them.

Quote

By the way, since a GUI library needs to draw many small entities, it's very close to what I need to optimize in SFML. So if you feel like there are optimizations that I could apply to SFML, don't hesitate to share.

Basically: VBO. The whole GUI is stored in one single VBO, together with a texture atlas for one single texture (this may change in the future as there're limits regarding the maximum texture dimensions). Then there're a lot of matrix operations and other OGL calls that we can save because we're only calling what's indeed needed.

I think (binary1248 can give better explanations as he did the renderer) the biggest benefit is saving the bus (to GPU) the trouble by avoiding sending buffers (vertices, texture coordinates and colors) every frame.

The optimizations are actually shared; SFGUI is open source.

It's quite easy to see how the renderer works (check Renderer.cpp and Primitive.cpp at first, most important files regarding the rendering).

Elgan · « **Reply #259 on:** January 28, 2012, 09:31:18 pm »

This is very fun reading..there is software which will measure performances.

maybe it would be fun to make a test benchmark thingie for SFML aps of sorts...not sure how it would work right now..

fps, and external measuring memory and ..hm whatever else.

binary1248 · « **Reply #260 on:** January 28, 2012, 10:12:12 pm »

Good idea... Laurent can decide what he deems worthy to benchmark in every version of SFML ^^. Then just write a spec and that will be the standard of testing. Such areas could be e.g. text rendering, sprite rendering, shape rendering etc.

Laurent · « **Reply #261 on:** January 28, 2012, 11:53:49 pm »

Quote

The key when testing is making sure you can reproduce your results, which in turn means the environment is fully understood and taken into account.

I'm pretty sure that you don't fully understand how SFML can impact your performances. For example, there's a bug in event handling on Windows that will slow down some applications randomly. However it happens once per frame so if your test application is really loaded it will hardly make a difference on the final result.

I'm not saying that your tests are flawed, I'm even sure that they are strictly executed and interpreted. But that's not the most efficient way of testing things, and some other people might not trust your results.

Quote

It was definitely better performance. Before the new graphics API the highest FPS we could get was ~400 (SFGUI test application), with our custom culling and using display lists we could get it up to ~1,600 FPS.

The new graphics API put out 1,200 FPS, without custom culling and without display lists. So it was still slower than the old API together with our optimizations, but faster without them.

Thanks for the feedback.

Quote

Basically: VBO.

I was afraid you would say that

A GUI is a static thing, so I guess that VBO are perfect. In SFML things are more complicated, I cannot assume any particular usage. For example, many people use a single dynamic sprite to draw everything in their game. I must design and implement things as if every property of every entity could change every frame.
At least, with the new API, people who know a little about graphics programming can write efficient code with SFML. They're no longer stuck with slow sprites.

Quote

The optimizations are actually shared; SFGUI is open source.

True

binary1248 · « **Reply #262 on:** January 29, 2012, 12:41:00 am »

Quote

For example, there's a bug in event handling on Windows that will slow down some applications randomly.

Interesting... another bug I didn't know about.

Quote

I must design and implement things as if every property of every entity could change every frame.

That's exactly what the GL_STREAM_DRAW usage hint was designed for. It tells the GPU that it can expect buffer data to change between every draw call (even multiple times). The difference is that with a VBO which you update completely every frame the data is in a single buffer on the card. Think of it like calling new int; 1000 times and calling new int[1000];. The second variant would probably complete faster for the exact same reasons. And if you "prepare" the data to be as GPU friendly as possible, it will reward you appropriately.

"Prepare" would mean things like:

What one needs to look for are values which the GPU probably has to calculate every frame but stay exactly the same. They can be calculated when needed on the CPU and passed to the GPU "prepared" so it can save a lot of effort putting those pixels on the screen.

If you want an extreme (and still quite buggy) example of how I prepare data for the GPU, have a look at the texture preblending we do in the new Renderer. It offloads the blending from the GPU to the CPU under the assumption that the blended pixel values stay the same over all frames. I tried it out because 1. GPUPerfStudio was telling me that the GPU was stalling on buffer operations and 2. because I was crazy enough and had too much time. It seemed to harvest more performance and can work under the right circumstances.

The key is really to make a library so intelligent it knows how it can optimize the users data by itself in every situation.

For those who are curious, during the implementation of the new renderer I used gDebugger, GPUPerfStudio, valgrind and of course faithful gprof.

Laurent · « **Reply #263 on:** January 29, 2012, 09:38:02 am »

Quote

That's exactly what the GL_STREAM_DRAW usage hint was designed for

My tests showed that locking/updating/unlocking a GL_STREAM_DRAW VBO is still slower than a vertex array, which is already slower than immediate mode in such a context.

Quote

The difference is that with a VBO which you update completely every frame the data is in a single buffer on the card. Think of it like calling new int; 1000 times and calling new int[1000];

Does it mean that you create one single big VBO and then "allocate" your widgets' geometries inside it with a custom algorithm? I never succeeded to write such an implementation, because writing an efficient allocator is really complex.

Quote

Converting your geometry data to draw using one primitive type

SFML is not high-level enough, it explicitely allows one to choose its primitive type.

Quote

Reducing state changes (less texture binds, etc.) by batching and reordering the draws on CPU without influencing the final outcome of the frame.

Again, SFML is not high-level enough, it must do immediate drawing so no batching is possible. I optimize state changes and pre-transform small entities on the GPU but I feel like this is the maximum that I can do.

With a GUI, the order is defined by the parent-child relationship, you can easily have a scenegraph behind the scenes and benefit from all the nice optimizations that such a data structure allows. This is hardly applicable to SFML.

Quote

The key is really to make a library so intelligent it knows how it can optimize the users data by itself in every situation.

Is it your feeling about SFML too, knowing that it provides low-level primitives and doesn't know what the user will do with them?

Laurent · « **Reply #264 on:** January 29, 2012, 10:56:38 am »

I've had a look at Renderer.cpp, and now I understand your rendering strategy (you can ignore my related question above). It is definitely not applicable to SFML because I can't batch everything and delay all the rendering until the end of the frame.

I've seen some really nice ugly hacks and the even nicer comments associated to them about SFML. If you want to talk about these issues I'm here

binary1248 · « **Reply #265 on:** January 29, 2012, 01:09:34 pm »

Like I said, a library has to recognize opportunities to optimize and do it the best it can. Of course you can't optimize in exactly the same way for every single use case there is.

For example some people might not make use of VertexArrays or custom primitive types for whatever reason. Then you can assume that whatever he draws every frame, sprites, text, etc. can be broken down into triangles.

You could also for example batch text draws together. Say the user draws multiple sf::Texts after each other (very common from what I've seen) with the same sf::Font (face and size same), you can also batch those together. It saves you from stopping to check whats next to draw only to find out that it's exactly the same kind of data that you previously drew and even using the same texture.

Quote

My tests showed that locking/updating/unlocking a GL_STREAM_DRAW VBO is still slower than a vertex array, which is already slower than immediate mode in such a context.
...
It is definitely not applicable to SFML because I can't batch everything and delay all the rendering until the end of the frame.

Well correct me if I'm wrong, but the user won't see anything on the screen until he calls Display on his window anyway. So whether the drawing takes place right where he calls it or is saved and performed in the same order right before the buffer is swapped, I don't see the difference. The big one though is that you would transfer your data in bigger chunks which is where VBOs start to shine. As long as data runs around host memory or GPU memory it stays fast. When it has to run across the PCIe bus, and that too many times per frame, it becomes the bottleneck which is what you are probably seeing in your comparison between Vertex Arrays and VBOs.

VBOs also don't perform too well if they are too small. So to use them properly you would have to serialize a lot of that primitive data together and draw multiple times using that 1 buffer. I also don't have to stress that VBOs don't have to be drawn completely in 1 pass. They can hold data for different primitive types. Heck they can hold completely different data sets together one after the other. Whatever you can draw using multiple Vertex Arrays you can draw using 1 VBO and multiple draw calls.

Laurent · « **Reply #266 on:** January 29, 2012, 07:02:28 pm »

Quote

You could also for example batch text draws together. Say the user draws multiple sf::Texts after each other (very common from what I've seen) with the same sf::Font (face and size same), you can also batch those together. It saves you from stopping to check whats next to draw only to find out that it's exactly the same kind of data that you previously drew and even using the same texture.

I already have a state cache, I only set the states that changed between two Draw calls.

Quote

Well correct me if I'm wrong, but the user won't see anything on the screen until he calls Display on his window anyway. So whether the drawing takes place right where he calls it or is saved and performed in the same order right before the buffer is swapped, I don't see the difference.

Good point. That reminded me of something, so I checked and found that two years ago I already tried to implement batching in SFML 2.
Here is what I said on 19/01/2010:

Quote

The automatic batching system was great, but after using it for a while and collecting feedbacks, I realized that it was creating new problems that were very tricky to solve.

Unfortunately I don't remember what these problems were.

binary1248 · « **Reply #267 on:** January 29, 2012, 09:00:44 pm »

Quote from: "Laurent"

I already have a state cache, I only set the states that changed between two Draw calls.

State changes aren't the only things you can save on although they make up a big piece of the time it takes to draw a frame. Draw calls are almost as expensive overhead-wise as state changes. Drawing 10 Sprites/Rectangle shapes for example requires vertex data for just 40 vertices but causes more than 40 OpenGL calls to be made.

Quote from: "Laurent"

Good point. That reminded me of something, so I checked and found that two years ago I already tried to implement batching in SFML 2.
Here is what I said on 19/01/2010:
Quote
The automatic batching system was great, but after using it for a while and collecting feedbacks, I realized that it was creating new problems that were very tricky to solve.

Unfortunately I don't remember what these problems were.

Well... you did change the drawing routines completely and don't use glBegin() glEnd() anymore. So maybe those problems won't carry over to the new drawable API. A link to that thread would be nice.

Laurent · « **Reply #268 on:** January 29, 2012, 10:25:31 pm »

Quote

Well... you did change the drawing routines completely and don't use glBegin() glEnd() anymore. So maybe those problems won't carry over to the new drawable API.

This code had implementations for VBO, VA and IM.

Thread:
http://www.sfml-dev.org/forum/viewtopic.php?t=2063
(not very helpful because that's where I say that I removed the batching stuff)

More useful, the last revision where it was used:
https://github.com/SFML/SFML/tree/8ba9495c02f95dbff8aee44121a13f999234fb2f

binary1248 · « **Reply #269 on:** January 30, 2012, 01:43:12 am »

From what I can tell reading through the source, it would have been bottlenecked by the CPU instead of the GPU. Thus whether you used VAs, VBOs or IMs it probably wouldn't make any significant difference. Your usage of the word "Batch" to describe the class containing the data for a single drawable is also kind of misleading. They weren't really batched data and so could not profit from batching at all.

Your idea of uploading data into a single buffer and drawing all at once at the end was good. HOWEVER, if you only draw the data one object at a time they will be, as you saw, hardly any better than VAs or even IM.

It would have probably made a big difference if you had stored more relevant data inside the Renderer object and let it manage drawing the objects itself when the time came. That way it would have been able to truly batch multiple objects together if it saw the possibility, saving not only a little GPU time but a massive amount of CPU time. Contrary to what people think most state changes and matrix ops take part on the CPU in the driver and the data gets sent in it's raw form to the GPU. Thus if the CPU is already busy going through all the batches every frame, the FPS will be hurt even more by redundant state changes which were abundant in that version.

Because you changed SFML a lot since then and cache states more effectively now and even use VAs as the primary drawing method, it would be nice to see how that old concept would fare in the current implementation.

And I'm curious, were these problems you speak of bugs/glitches or the flexibility/limitation kind of problems? I couldn't find any reports of problems related to the old drawing method while searching through those old threads.

Quote

I've seen some really nice ugly hacks and the even nicer comments associated to them about SFML. If you want to talk about these issues I'm here

Wishlist (among other things to make SFGUI less "hacky"):