This is impressive. Though perhaps not very useful. Humans (and animals in general) are quite bad at precisely locating sound anyway. We only have two input channels, the right and the left ear, and any location information comes from a signal difference (loudness usually) between the two.

Localization of sound is primarily based on the time difference between the ears. Localization is also pretty precise, to within a few degrees under good conditions.

Nit: time difference, phase difference, amplitude difference, and head related transfer function (HRTF) all are involved. Different methods for different frequency localisation.

There's this excellent (German?) for website that lets you play around and understand these via demos. I’ll see if I can find it.

Edit: found it, it’s https://www.audiocheck.net/audiotests_stereophonicsound.php

I think for stereo sound, media like music, TV, movies and video games use loudness difference instead of time difference to indicate location.

In music, simple panning works okay, but never exceeds the stereo base of a speaker arrangement. For truly immersive listener experience, audio engineers always employ timing differences and separate spectral treaments of stereo channels, HRTF being the cutting edge of that.

I believe Atmos as used in cinema rooms, is as far as I know amplitude based (VBAP probably), and it is impressive and immersive. Immersion depends more on the number and placement of loudspeakers. Some systems do use Ambisonics, which can encode time differences as well, at least from microphone recordings.

HRTF as used in binaural synthesis is for headphones only, not relevant here.

Tihs is true, but a high density of loudspeakers allows the use of Wave Field Synthesis which recreates a full physical sound field, where all 3 cues can be used.

At least video games use way more complex models for that, AFAIK. It might be tricky to apply to mixes of recorded media, so loudness is commonly used there.

Unreal Engine, the only engine I'm more familiar with, implements VBAP which is just amplitude panning when played through loudspeakers for panning of 3D moving sources. It also allows Ambisonics recordings for ambient sound which is then decoded into 7.1.

For headphone based spatialization (binaral synthesis) usually virtual Ambisonics fed into HRTF convolution is used, which is not amplitude based, specially height is encoded using spectral filtering.

So loudspeakrs -> mostly amplitude based, headphones not amplitude based.

Which makes sense, there is only so much you can do with loudspeakers to affect the perceived location, you don't really know where the loudspeakers and the listener are located relative to each other.

Actually, the farther way the speakers are from the angles specified in the 7.1 format (see https://www.dolby.com/about/support/guide/speaker-setup-guid...) worse will be the localization accuracy. And if the the person is not sitting centered relative to the loudspeakers, but closer to one of the loudspeakers localization can completely collapse, and it will sound like the sound only comes from the closest loudspeaker.

In the case of gamers, they are usually centered relative to the loudspeakers, and usually the loudspeakers tend to be placed symmetrical to the computer screen, so the problem is not so bad.

For cinema viewers sitting in the cinema the problem is much worse, most of the audience is off center... That is why 7.1 has a center loudspeakers, the dialogue is sent directly there to make sure that at least the dialogue comes the right direction.

I'm sorry, but this is not accurate at all. Using "only" two signals, humans are quite good at localizing sound sources in some directions:

Concerning absolute localization, in frontal position, peak accuracy is observed at 1÷2 degrees for localization in the horizontal plane and 3÷4 degrees for localization in the vertical plane (Makous and Middlebrooks, 1990; Grothe et al., 2010; Tabry et al., 2013).

from https://www.frontiersin.org/journals/psychology/articles/10....

Humans are quite good at estimating distance too, inside rooms.

Humans use 3 cues for localization, time differences, amplitude differences and spectral cues from outer ears, head, torso, etc. They also use slight head movements to disambiguate sources where the signal differences would be the same (front and back, for instance).

I do agree that humans would not perceive the location difference between two pixels next to each other.

As I wrote elsewhere:

> Yet I'm usually not even noticing whether a video has stereo or mono sound. So I highly doubt that ultra precise OLED loudspeakers would make a noticeable difference.

Yep, the hearing is more akin to a hologram than a mere stereo pair imaging.

You are misinformed.

Amplitude, spectral, and timing are all integrated into a positional / distance probability mapping. Humans can estimate the vector of a sound by about 2 degrees horizontal and 4 degrees vertical. Distance is also pretty accurate, especially in a room where direct and reflected sounds will arrive at different times, creating interference patterns.

The brain processes audio in a way not too dissimilar from the way that medical imaging scanners can use a small number of sensors to develop a detailed 3d image.

In a perfectly dark room, you can feel large objects by the void they make in the acoustic space of ambient noise and reflected sounds from your own body.

Interestingly, the shape of the ear is such that different phase shifts occur for front and rear positions of reflected and conducted sounds, further improving localization.

We often underestimate the information richness of the sonic sensome, as most live in a culture that deeply favors the visual environment, but some subcultures and also indigenous cultures have learned to more fully explore those sensory spaces.

People of the extreme northern latitudes may spend a much larger percentage of their waking hours in darkness or overwhelming white environments and learn to rely more on sound to sense their surroundings.

I learned to move around in dark rooms when I was young, I definitely can "feel large objects by the void they make" and people often turn on the lights because they thing I need them to "see" when I really don't.

> Humans can estimate the vector of a sound by about 2 degrees horizontal and 4 degrees vertical.

Yet I'm usually not even noticing whether a video has stereo or mono sound. So I highly doubt that ultra precise OLED loudspeakers would make a noticeable difference.

The main utility isn't for the user to more precisely locate the sound source within the screen. Phased speaker arrays allow emitting sound in controlled directions, even multiple sound channels to different directions at the same time.