Accessible Audio Settings for Film and Television
Sound mixes for film and television don't suck, but your playback situation probably does. Let's find a compromise.
I’m going to say probably the most controversial thing someone who has worked in audio post-production can say: viewers should be able to mix their own audio levels on streaming services.
Why is this controversial? Well, sound mixing is one of the many crafts that goes into film and television creation. There is a lot of subtlety to a well-mixed film—things that might not always be noticed on a casual listen, such as a slightly louder room tone that creates some tension in a scene, or purposefully making the music in a bar or club just a bit louder than the voice. It’s not just a craft, it’s an art. So, if we let viewers change the art of the mix, are we ruining the art of film and television? I don’t think so, for a number of reasons. I’ll explain, but first, we have to understand some of the technical aspects behind mixing audio for film and television.
An audio mix happens in a carefully calibrated environment, usually measured to a standard like Dolby Atmos. For films, this usually happens in a large mixing stage that basically looks like a movie theatre. The idea is that a film mixed to a particular theatre standard, will sound the same at theatres around the world that also follow that standard. That makes sense, consistency in reproduction is a good thing.
For television, shows are often mixed in smaller rooms, but these rooms are still calibrated and measured to be accurate audio monitoring environments. These are “ideal” listening scenarios. I can understand film being mixed with the theatre in mind, but television (especially in the age of streaming services) is going to be listened to on a million devices in a million environments, with different speaker set-ups, by different brands, and so on. How do you mix something to sound good on both a studio-grade 5.1 Surround Sound setup and on laptop speakers?
Further complicating this is the fact that with a 5.1 mix, the dialogue is almost always located in the centre channel. It benefits from being relatively isolated from other sound elements in a mix. However, when it’s mixed-down for 2-channel stereo playback, the dialogue is technically competing for space with all the other sounds, such as ambience, foley, music, explosions, etc. And yes, even 1-channel mono compatibility is still something to factor in today’s world as I learned the hard way last Spring.
How do sound mixers account for all these setups? Well, they might listen on a secondary set of speakers that emulate lower quality setups, the famous examples that come to mind are the Yamaha NS-10Ms and the Auratone 5C Sound Cube (lovingly called the “shit box” for its tonal quality). For some perspective, this is also done with music mixing where one of the historical benchmarks was listening to your mix in the car. If your mix sounds good enough in the worst circumstances and great in the best, then you should be good to go. But in the end, not all mixes will translate well, which in film and TV, leads some viewers to proclaim that the mix is terrible (spoiler: the mix isn’t terrible, their setup is). So I think it’s time we give viewers some level of control over the audio mix.
What made me change my mind about all of this was video games. In recent years, video games have started offering unprecedented control over a variety of audio and visual parameters that allow players to enjoy the game, whatever their circumstances are. These are often, but not exclusively, located under accessibility settings. Things like colour-blind modes which alter certain colours to be easier to differentiate, or adding textures/shapes to certain in-game objects that are colour-coded. There have also been vast improvement in subtitles/closed-caption options, which aren’t just useful for those who are hard of hearing, but also people who want experience the game in its native language, but it’s their second or third language. And of course, there are volume sliders for audio.
At their simplest, volume sliders can be broken up into three main categories: music, sound effects, and dialogue. It’s possible to get more granular with audio as well. For example, are character “barks”—the sounds that come from non-player characters1 around the in-game world like “They’re shooting at us, take cover,” or “Fresh fish, we catch ‘em, you buy ‘em.”—dialogue or sound-effects? An ally telling you important information is different from a random merchant trying to sell you fish, so how do you categorize them without providing 50 options to the player to choose from? I for one would like to see even more audio options from games, but I digress.
Different games also mix their audio in very different ways. Games like the recent God of War series (Santa Monica Studio) and Elden Ring (FromSoftware) have relatively quiet dialogue with extremely loud and impactful SFX in combat. Part of this makes for a very dynamic soundscape which is exciting, whereas a game like Celeste (EXOK) is a lot flatter by comparison. That doesn’t mean Celeste sounds bad, quite the contrary, it’s a very well-mixed game, it just means that there are different styles of mix.
Here is an example of some typical moments in Elden Ring, including UI sounds, menu-ing, dialogue, ambiance, attacks, music, and more:
In contrast, here is an example of a typical gameplay moment in Celeste, with music, walking, jumping, death, and other sound effects:
Likewise, film and television can have very different mix styles. A Marvel superhero flick is going to have greater dynamic range than a three-camera sitcom. And that’s where watching a movie in theatres shines: it’s a dynamic and exciting audiovisual spectacle. Some people find certain mixes to be a bit low, but overall, dynamic range has a place in cinema. However, watching films or TV shows on phones, laptops, iPads, TVs with or without sound systems, and so on, is a very different experience. And there are a number of reasons why viewers might want to turn down the SFX, such as a child sleeping in the other room, or they are hard of hearing but don’t necessarily need closed-captions to follow the dialogue.
Finally, I think it’s important to understand that there is an obvious technical reason this was never done for film or broadcast television: there’s no way easy way for viewers to control the levels.2 But in the age of streaming, it would be quite simple to add a few volume sliders for music, SFX, and dialogue. Streaming services already provide users a way to select different language audio dubs, so they would merely need to add an option for music and SFX, with volume sliders for each.
It’s not even extra work for sound mixers, as most high-level productions produce what are called “M&E” tracks, which are isolated music and effects (SFX) tracks for international dubs (usually with additional foley for all the sound that was generated on set in the original language that will be absent in dubs). The sound mixers have already done the work of isolating the audio tracks and the technology should be readily available on most streaming platforms if they choose to implement it.3 Really, it’s an accessibility feature that should just be built in, like closed-captions.
Will this ruin the art of sound mixing? No. In a world where streaming services let you playback videos at 2x speed, a few tweaks to audio levels won’t destroy the film. The developers of Celeste made sure to let players know that the default options are how the game was designed to be played, but still made more accessible ways of playing possible. I believe that sound mixing is an art, but perhaps we too should let default audio levels be the intended design and allow people to make their own choices, according to their own needs.
Technically, player-characters also have “barks” such as grunts and screams while jumping or attacking, but I wanted to keep the focus on NPCs in my example.
Broadcast television does have the technology for additional audio streams, like descriptive audio, but there’s no way to independently control the volume for each audio stream.
There could be some arguments about the bandwidth needed for streaming services to send 18 channels of audio (3 * 6 channels, for 5.1 Surround Sound) instead of just 6 for a 5.1 mix. But I think there are ways to work around that, such as stereo-only options for regular tier subscribers, 5.1 mix options for higher-tier subscribers, etc.