I don't necessarily see why the two functions I referenced would particularly interrupt the OpenGL pipeline but I don't have a tremendous amount of experience with the inner workings of OpenGL so I'm more than willing to concede that point.
Modern GPUs are becoming eerily similar to the "rest of the machine" that they reside in. They have their own memory, fixed functionality processors and programmable shaders. Just like any CPU, the programmable shaders execute instructions based on a program that is loaded into their instruction memory. And just like any CPU, they perform load/store operations meaning reads/writes to/from memory (i.e. non-register storage). This takes time, a lot of time, that's why just like CPUs shader units have their own caches as well. They try to retain "hot data" so that it doesn't have to be fetched/written too often thus minimizing stalls. I don't know how intelligent the cache and prediction subsystems have become in GPUs, but many people have experienced that a lot of non-linear memory accesses can have a measurable hit on performance as opposed to linear memory accesses. This means that "jumping back and forth" within a potentially large data set such as yours can lead to a high number of cache misses and potential performance drop.
This is all assuming that those draw commands are actually sent to the GPU verbatim, such as when using glMultiDrawElementsIndirect. You can never underestimate how much work the driver actually ends up doing for you. It might very well be the case that specific implementations even go ahead and split your single multidraw call up into multiple batches because it estimates the GPU will execute them faster.
Memory constraints are also obviously a big concern however for my particular needs in this project, which again may be strictly specific to me, my solution seemed to make stricter use of memory compared to the basic implementation.
Did you measure this and determine that this is a
real problem? Do you have real-world numbers that speak for themselves? And I'm not talking about a single isolated primitive and its usage, I'm talking about your software/library put to use in a... real scenario.
additional instances of VertexArray, which as a C++ class with several functions and members such as std vector, does come with its own overhead that could add up depending on how many calls need to be made
Functions don't take up memory, and sf::VertexArray consumes, depending on system, somewhere between 12-32 bytes each. Even with 1000000 of them, that's 12-32 MB memory "overhead" which should be barely noticeable compared to the actual data. Sending an sf::VertexArray off to sf::RenderTarget to be rendered does take time, but if your batches aren't too small (which should usually be the case in a typical scenario when even considering using sf::VertexArray), then this time is amortized by the rest of the time spent doing the actual work.
Did you try benchmarking/profiling/analyzing memory usage with and without your optimization? Because if the difference is less than 5% on average in a non-synthetic scenario, you should really consider whether the time you put into it was actually worth it.