Author Topic: Vec3, Vec2 Speed Test (Read 8728 times)

Meltra Bour · « **on:** September 08, 2009, 12:31:46 pm »

This weekend I spend some time looking for a performance boosts, one of the things I tested where our vec2/3/4 classes. First off we have our own, we are not using those provided by sfml, we only use sfml to open a window and manage input (atm). But sins the classes look alike I was left wondering way Vec3 and Vec2 are not declared inline and even further down way are some huge Unicode function inline ?

Below some of the different way's I tested …

Initial Code: Overload outside of the class

Code: [Select]

    template <typename T>
    class Vec2
    {
        public:
           ...
    };

    template <typename T>
    Vec2<T> operator* (const Vec2<T>& v1, const T& v2);

    ...
    template <typename T>
    inline Vec2<T> operator* (const T& v2)
    {
        return Vec2<T>(v[0]*v2, v[1]*v2);
    }

improvement 1: Overload inside the class

Code: [Select]

    template <typename T>
    class Vec2
    {
        public:
            ...
       Vec2<T> operator* (const Vec2<T>& v2);
            ...
    };

    ...
    template <typename T>
    Vec2<T> Vec2<T>::operator* (const T& v2)
    {
        return Vec2<T>(v[0]*v2, v[1]*v2);
    }

improvement 2: inline decelerations

Code: [Select]

    template <typename T>
    class Vec2
    {
        public:
            ...
       Vec2<T> operator* (const T& v2);
            ...
    };

    ...
    template <typename T>
    inline Vec2<T> Vec2<T>::operator* (const T& v2)
    {
        return Vec2<T>(v[0]*v2, v[1]*v2);
    }

- The actual test involved drawing 125k cubes and animating them, this lead to +-8 million operations with vec2/3 per frame. The test was repeated 100(frames) * 2 (linux, windows) * 3 different pc's …
- I tried a lot more then just the 2 above but those stood out and they did not break anything (in our engine)
- Only tested with gcc both on linux and windows …

So here are the results I got:

Initial Code = 100%
improvement 1 = 126,23% - 26% faster
improvement 2 = 277,60% - 177% faster

The classes in sfml kinda look like our initial code, so maybe it would be a boost for sfml as well ? Is there a specific reason sfml is overloading outside of the class ? Something I might have missed ?

Laurent · « **Reply #1 on:** September 08, 2009, 12:37:26 pm »

Well, inside or outside doesn't change anything (except that inside the class is considered inline by default), the inline keyword should be enough.

Could you show the compiler options that you used for your tests?

Meltra Bour · « **Reply #2 on:** September 08, 2009, 01:14:41 pm »

g++.exe -m32 -c -O3 -s -I./inc -I./lib/inc -MMD -MP -MF <files> -o <files>
g++.exe -o ./bin/test.exe -s <files> -L <libs>

Laurent · « **Reply #3 on:** September 08, 2009, 01:49:43 pm »

Ok, nothing wrong with them

Thanks for your feedback, I'll modify the code as soon as possible.

Note that the speed improvement may not be noticeable in SFML itself, but it's still better to have this kind of operators inline.

Meltra Bour · « **Reply #4 on:** September 08, 2009, 02:03:07 pm »

I had to look it up but if my info is right inside vs outside the class will create 3 vs 2 temporary variables in the cpu's cash/register.
1 vs 2 from the variables passed to the function and 1 for the result.

And yha we use those operators a lot so the difference in sfml might be less but still every cpu cycle counts so …

np

Nexus · « **Reply #5 on:** September 08, 2009, 03:25:07 pm »

To be honest, I wouldn't make the operators to member functions just because one user measured a performance difference. But inlining the global function templates (by the keyword "inline") should be ok.

However, I wonder why your compiler isn't able to optimize that...

Laurent · « **Reply #6 on:** September 08, 2009, 04:14:54 pm »

I don't understand why the generated code would be different for the member operator definition compared to the non-member version.

The non-member version is much cleaner, so I'll keep it like this

Meltra Bour · « **Reply #7 on:** September 08, 2009, 07:08:19 pm »

Quote from: "Nexus"

However, I wonder why your compiler isn't able to optimize that...

That was a good one. I have no clue when it comes to that stuff so I just use the tools provided, never took a closer look at them before.
But for some reason we are still using gcc 3.5, I updated this to 4.4 on my windows machine and ...

- All 3 test are exactly the same, no difference in performance at all
- The .exe doubled in size but it was exactly the same for all tests mentioned.
- Performance is a lot better then the fastest test I got with gcc 3.5, at first glance it's more then twice as fast.

all it took was updating from gcc 3.5 to 4.4, thx for the hint

Meltra Bour · « **Reply #8 on:** September 15, 2009, 11:54:04 am »

I'm back to my first finding ... using the keyword inline speeds up the app with gcc 3.5 and 4.4.

As for the member operator definition compared to the non-member version. I only see a difference for that in gcc 3.5, gcc 4.4 doesn't seem to care about it when the inline keyword is used. If your not using 'inline' then member functions will result in a small performance boost.

We where using gcc 3.5 because gcc 4.4 takes optimization a bit further then you want it to. Profiling bits of code in gcc 4.4 is tricky business, the tests I did where using default shapes and those shapes are hard coded in to the app so ...

gcc 4.4 changed the code around from something like

Code: [Select]


float x = 1.5f;
float y = 2.5f;

main() {
  float z = x + y;
  std::cout << z;
}

to

Code: [Select]


main() {
  std::cout << 4.0f;
}

nice one but not what you want when your trying to figure out the amount of operation your app can pull off. I should type up some code so you can test it your self but ... to lazy atm.

Tank · « **Reply #9 on:** September 15, 2009, 12:56:27 pm »

What about -O0?

Meltra Bour · « **Reply #10 on:** September 15, 2009, 01:54:15 pm »

not sure what you mean Tank, -O0 would not optimize the code at all. We only use -O0 (default) to debug or test new code. My target was to figure out the max amount of verts we could animate and look for way's to up that amount. So I don't see the point in testing it with -O0 ?

edit: hmm maybe there are some benefits in testing it with -O0, make sure it's as fast as possible that way and then up it to -O3 to get final numbers ...

Tank · « **Reply #11 on:** September 16, 2009, 11:23:33 am »

Quote

edit: hmm maybe there are some benefits in testing it with -O0, make sure it's as fast as possible that way and then up it to -O3 to get final numbers ...

That's exactly what I meant with my suggestion. You were complaining about the compiler optimizing your code so that you aren't able to test well enough. -O0 disables optimizations and lets you profile your code.

resistor · « **Reply #12 on:** September 16, 2009, 11:28:19 am »

Quote from: "Tank"

That's exactly what I meant with my suggestion. You were complaining about the compiler optimizing your code so that you aren't able to test well enough. -O0 disables optimizations and lets you profile your code.

Profiling unoptimized code isn't always very useful, since the things that take a long time in unoptimized code can be totally different than the things that take a long time in optimized code.

Tank · « **Reply #13 on:** September 17, 2009, 04:55:05 pm »

True, but considered that most compilers optimize completely different, profiling unoptimized code *can* give a hint where bottlenecks are. If the unoptimized code you write is fast, then it's probably fast optimized with all compilers.

But yeah, this is mostly unnoticable.