You need to convert the audio into volume data rather than amplitude/waveform data. That is you need to know how loud you sound is at a particular point in time? You "sample" a few samples around the time (either some before and some after or all before, especially if 'live'/streamed) and calculate its volume. Volume here means perceived volume or loudness, not the overall sound's volume control (sound.setVolume - not this).
A simple way to do this is to simply average the height of the wave over a block of time:
(semi-pseudo code):
auto total{ 0 };
auto currentSampleIndex{ getSampleIndexForCurrentTime };
for (auto i{ 0 }; i < blockSize; ++i)
{
total += std::abs(sample[currentSampleIndex - i]);
}
total /= blockSize;
total /= 32768;
This should give you something in the range of 0 - 1 that represents its overall volume based on its block size. Note that the maximum volume should practically be around 0.5 unless extremely high or extremely low frequency at extremely high volumes or high
DC offsets. The block size you will have to determine yourself but you can try 100, 200, 1000, 10000 or something but be aware that the larger the block, the more work it has to do.
A block size of 10000 on a sample with a sample rate of 44.1Khz is less than a quarter of a second of sound so this may be an acceptable size.
For more accurate results, you could convert to
sound units or
convert the sound to frequencies and add extra "weight" to the lower frequencies.