Welcome, Guest. Please login or register. Did you miss your activation email?

Author Topic: Optimising CPU to GPU data transition  (Read 880 times)

0 Members and 1 Guest are viewing this topic.

OcelotTheOcelot

  • Newbie
  • *
  • Posts: 2
    • View Profile
    • Email
Optimising CPU to GPU data transition
« on: September 03, 2023, 04:27:25 pm »
Hello.
I recently started to explore SFML and my little project is to create a game based on falling sand simulation (e.g. like Powder Toy or Noita).
My progress stumbled upon the issue of sending images from CPU to textures on GPU. My curent setup is to change the image every frame according to the rules of the cellular automaton, i.e. edit an instance of sf::Image (e.g. 40962 pixels ins size) and send the result to the corresponding instance of sf::Texture by using Texture::Update(sf::Image&) function.
This texture is used by many sf::Sprite instances that are used to render different rects of that texture, no problem with that.
I would like my in-game particles (each represented by a pixel on the image) to be moving as smoothly as possible, updating the aforementioned texture every frame. My problem is that sending data from CPU to GPU is quite costly to perform every frame: my DIY FPS-meter shows ~205 FPS against ~3–4k FPS when the texture update isn't called but the automaton is still working.
Naturally, I would like the automaton's logic processing to be the bottle neck of my app, but right now it's the CPU/GPU data exchange, which I have no idea how to deal with using the provided SFML API. I am already using only one single texture instance, but I can't go lower than 1 update per frame, since it's the whole point of visualising a cellular automaton.
I tried to search for existing paint applications on SFML, because I assumed they would be updating canvas every frame too, but they seem to be using line primitives for drawing with mouse rather than changing the pixels in the image every frame. I also looked for existing SFML falling sand simulations but what I found is using rectangle primitives for each separate particle which obviously doesn't fit my needs.
« Last Edit: September 03, 2023, 04:32:03 pm by OcelotTheOcelot »

Hapax

  • Hero Member
  • *****
  • Posts: 3351
  • My number of posts is shown in hexadecimal.
    • View Profile
    • Links
Re: Optimising CPU to GPU data transition
« Reply #1 on: September 06, 2023, 05:14:47 pm »
If your app doesn't do much more then maybe you can get away with 200+FPS? ;D

That said, an sf::Image is a very expensive resource and transferring it to the GPU per frame seems wasteful if not required.
And it, of course, isn't! ;)

You mention the rectangle primitives and it could actually provide you with a solution. Namely, you update the rectangles each frame instead of the texture. However, moving six vertices per particle also seems wasteful if they never change size or shape.

Then, a solution could be to use a vertex array. As the name suggests, this is an array of (single) vertices. This would be one vertex per particle.
You would transfer all of the vertices at once to the GPU (when you draw them) so you can draw them to a render texture that would, then, act as your current texture. The difference here is that you are only transferring the vertices, not the image/texture, as the texture is already in graphics memory and you just draw to it with the vertices.
You may also find that this can be improved using a vertex buffer but if you're definitely transferring every frame, it may not help much.

One question to ask is how much information are you transferring per pixel/particle? Are you simply just changing the colour (per pixel) to represent a type of particle (or even just particle or not)? If so, you don't even need the entire texture! You can encode the data into your texture and if you could store that information per pixel in just one byte, you could compress the information and reduce the image size by 4 (a quarter of its size) by storing 4 pixels/particles in a row in just one image/texture pixel. This significantly reduces the amount of data being transferring.
You do, of course, need to be able to decode this data and probably the best way would be to use a shader. This puts more work on the GPU calculations but reduces CPU->GPU transfer data.

Since we're on the subject of shaders, you could also send a chunk of data directly to the shader and have it process it for you. Again, if the particle can be represented by some limited values, you could send an array of numbers to represent those (I don't think SFML can do an (large) array of ints - just floats - so you may need to do some extra, but relatively simple, calculations).

If, however, a particle is a simple on/off, it should be easy enough to encode lots of those particles into a single Ivec4
 (4 integers)
and then send them to the shader using one or more SetUniform[Ivec4]

Actually, although the shader is probably the best route to take (to replace the image/texture transfer altogether), if you can represent a particle/pixel by a single boolean, you could encode 32 pixels into a single pixel of you texture and this would allow you to reduce your image size by 32!

If a solution you want isn't directly listed here, I hope what I've said can at least inspire you to find some other way!
Selba Ward -SFML drawables
Cheese Map -Drawable Layered Tile Map
Kairos -Timing Library
Grambol
 *Hapaxia Links*

OcelotTheOcelot

  • Newbie
  • *
  • Posts: 2
    • View Profile
    • Email
Re: Optimising CPU to GPU data transition
« Reply #2 on: September 10, 2023, 03:31:51 pm »
200+ FPS would indeed suffice, but my concern was that if the CPU-GPU data transition is the bottleneck, than it will be harder to notice a performance drop after implementing the automaton's processing logic, which should be the bottleneck instead, since it's the core system of my game.

The VertexArray solution does indeed seem interesing. I assume, the performance concern that have been mentioned here https://en.sfml-dev.org/forums/index.php?topic=11550.0 is irrelevant since there will be only one draw call on the entire VertexArray. It does seem that this would consume a bit more RAM instead, adding the vertex positions (although unchanged) to the total amount of data. Maybe it's possible to optimise it by excluding empty cells from it but I'd like to assume the worst case scenario by default, where the screen is always full of particles (not necessarily moving).

I can't let the shader handle the automaton's logic and my particles are represented in full RGBA colour, meaning I won't be able to use single bit per cell. Writing a whole codec just to send data between the two devices sounds like an overkill and a hacky solution, but that's my intuition speaking.

Using the shader actually seems like the most valid solution that I'm definitely going to try first. Additionaly, I assume that I'd be able to pass values higher than 0-255 range allows, which might be useful later to add special effects (e.g. glowing for lava particles with R value higher than 255) using shaders.
I've seen the same recommendation about using the shaders when I was implementing this system with Unity ECS, but in the end it turned out that their Burst-compiled code can operate on par with GPUs which made this issue mostly irrelevant.

Thank you for such a detailed reply!
« Last Edit: September 10, 2023, 03:37:04 pm by OcelotTheOcelot »

Hapax

  • Hero Member
  • *****
  • Posts: 3351
  • My number of posts is shown in hexadecimal.
    • View Profile
    • Links
Re: Optimising CPU to GPU data transition
« Reply #3 on: September 10, 2023, 04:36:26 pm »
You are most welcome! I hope it helped!

Indeed, many draw calls can be a significant issue for speed and using a single one for all vertices is the best solution for that. There can be some performance drop using an extremely high amount of vertices but that should be expected when doing so much work. It is, of course, nowhere the issues caused if drawn separately.

Presuming you're using floats, there's not really a limit to the values so you can, of course, use higher than the 0-255 range. Except, however, for the colour channels, which are still in that range.
You could 'stretch' the colours in the shader so that 0-n becomes 0-255 and n-255 becomes >255 (such as glow, for example).
If you're not using textures (and I don't think you are!) then even better since you have 2 free floats per vertex! You could use those to control effects, for example.

To clarify a little, I wasn't suggesting that the shader should be handling logic, rather it can know the choices of particle (colour) and select them from a value given.
This would simply be something like sending a colour value that represented a palette index (for example, just the red channel if can be only 256 colours) and then the shader can bring out the full 32-bit colour for each index. You'd send the palette to the shader using uniforms and then the vertices would have the indices. Again, if 256 different possibilities, you could represent 4 per single colour.

Since your vertices won't be moving at all, it might be better to use a vertex buffer although since you're updating the entire vertex anyway (just for the colour or whatever), it might not actually help that much. I haven't enough experience with vertex buffers to give a solid answer here.

Updating the vertex array should be faster than an image but, with that amount of data, it could begin to get similar issues. If any sections (not just individual particles but long groups of pixels) are unchanged, you can update just the parts you need but be aware that if you separate the parts you update, you are increasing the draw calls.

One other thing you may consider is using OpenGL directly. One advantage here is that you can set up the vertices to not even include parts you don't need and include parts you do. For example, the position could probably be skipped if represented in another way and maybe the texture too. This would just then be the colour parts of the vertex but this feels like it's just an image... ???
OpenGL is a lot more complicated though so if you are new to it, I'd certainly stick to trying to coax SFML into doing what you need!
SFML's more fun anyway ;D
Selba Ward -SFML drawables
Cheese Map -Drawable Layered Tile Map
Kairos -Timing Library
Grambol
 *Hapaxia Links*

Gavra Meads

  • Newbie
  • *
  • Posts: 2
    • View Profile
Re: Optimising CPU to GPU data transition
« Reply #4 on: January 18, 2024, 03:38:49 pm »
When optimizing your particle system for better performance, especially in the context of real-time graphics, consider implementing efficient strategies for b2b data building. Instead of updating the entire texture every frame, consider using sf::VertexArray to draw individual particles directly to the window. You can create a vertex array, update its vertices based on the particle positions, and draw it once per frame. This eliminates the need to frequently update the texture on the GPU.
sf::VertexArray particles(sf::Points, particleCount);
// Update particle positions in the vertex array each frame
window.draw(particles);

If you still want to use a texture, consider updating it less frequently. You can batch multiple particle updates and apply them to the texture at once, reducing the number of times you call Texture::update. This can be particularly useful if multiple particles are affected by the same rule. Efficient b2b data building practices can significantly enhance the overall performance of your graphics application.
« Last Edit: January 19, 2024, 09:28:47 am by Gavra Meads »