This is a controversial question that is intended to stir up a discussion about what SFML actually is, not an actual feature request.
Lets define what a media is. I believe media is any data that can be consumed by a human. For example, music is media because it can be parsed by a human, and visual art is a media. Sculpture is a form of visual art and is its own kind of media if you are permitted to touch it. Naively , since humans only have 6 senses through which to experience the world, sight, smell, taste, touch, hearing and time, media can be broken up into 6 categories.
But human experience is wider than the 6 physical sensations! There are internal sensations, like love and hate. So then it would be more precise to say that there are 2 classes of media, physical and nonphysical, and 6 classes of physical media. Mutimedia just means multiple mediums at once, for example a movie is multimedia. In fact, art can be multimedia if it causes you to feel something nonphysical, like wistfulness.
Naturally it is not possible to cause a person to feel an emotion without exposing them to some kind of physical stimuli. Therefore, all arts that can be classified as nonphysical media are also physical media as a matter of course.
This brings me to SFML, which stands for simple fast multimedia library. Media here obviously refers to physical media. With the current state of the art there are barely over 3 physical media that can be conveyed through the computer-- sight and sound and time obviously, and perhaps a limited tactile experience through vibrating rumble controllers or touch screens. So our ability to convey nonphysical media is restricted to what we can invoke through half of the human physical senses!
Despite this overwhelming drawback, compelling arts that invoke an emotional reaction in the consumer have been actualized through computers, and libraries like SFML enable artists to make these experiences happen.
If you take a good look at the SFML's documentation you will see that everything operates on media in some form. Whether it be storing the media or retrieving it, sending it across a network, parsing it or sending it to a device that will make it visible to the human. Not all operations are visible to the person, and a great deal is hidden in order to provide the programmer with a simple interface through which to convey his art.
SFML enables programmers to convey physical media so that the consumers of that physical media will experience a nonphysical response, like enjoyment, anger, or even something more cerebral like understanding -- as is the case with most non-entertainment based multimedia applications.
One question I would like you to think about is whether or not SFML would be better off or worse off combining media and treating combinations as first class entities. With barely more than 3 media through which to convey information to the consumer via the computer, there are only 7 combinations: {{Sound sight time}{sound time}{sight time}{sound sight}{sound}{sight}{time}}
Currently SFML's API revolves around using composition of the last 3 to build the other 4. What are the benefits and drawbacks of this approach?