All that’s true, but those factors affect all the audio similarly. The article specifically talks about server-side ad insertion, so it’s not like the case where it somehow uses the device’s .mov codec to play the content and an MP3 codec to play the ad. All ffmpeg (most likely) knows is that it’s decoding one long stream, and doesn’t switch audio pipelines mid-stream when it thinks it might be playing an ad at that moment.

Regarding the perceptual volume differences: while true, that’s also a solvable problem. Output volumes can be calculated using standard curves. In any case, TV broadcasters have had to figure all this out years ago.

> those factors affect all the audio similarly... Output volumes can be calculated using standard curves... TV broadcasters have had to figure all this out years ago.

Sorry, but all of that is obtuse. The fact that some digital audio can be perceived as much louder than others –– yet it's all limited to the same digital range –– proves they aren't similar at all.

There is no such thing as a standard curve for compression. Source levels vary almost infinitely. Accurately separating and reducing sound after the fact, without turning the whole thing to mud, is considered to be an impossible technical challenge.

Next, TV broadcasters worked on a predetermined schedule with predetermined advertising. This gave them time to inspect and approve ads in advance.

Streaming ads are generally served just in time from third-party services to the streaming host. FFMPEG gets the output from the stream host, but the host has to combine content together from multiple sources (entertainment + multiple ad servers) into that single stream. Currently, sound-level is completely at the whim of each ad server, as well as each ad producer. Meanwhile, the final output is at the whim of the streaming host: 24-hour-news streaming sites probably have different audio standards than Apple TV+.

Ultimately, AI could potentially be used to solve it, since it can generate / make-up new sounds as part of reverse-compression. But it would still have to be done in advance by the third-party ad servers.

None of this is true. There are standard curves for human hearing frequency response and you can use these to compare sound A’s volume to sound B. And since sound compression is in DCT space, you can calculate those numbers very quickly with something similar to sum(vol(f) * curve(f) for f in encoded_frequencies).

I read the article. It specifically talks about server-side ad embedding, i.e. where the service is inserting ad content into the streams, and therefore, by definition, has access to the ad content. They can do the calculations on their end during the embedding process and normalize volumes there before transmitting the result. To make things even easier, they don’t have to calculate the ad volume each time one’s streamed, just once per ad they’re going to serve.

And finally, all of this is a solved problem for TV broadcasters. They face the same problems: advertisers send them content to air, then the broadcasters are legally required to normalize the ad vs content volume, and they do. If this is an insurmountable problem that the streaming services face, they can drive over to their nearest TV station and ask them how they manage to pull off this technological feat.