So you are saying we get dieplays that could run wavefield synthesis?

If you don't know what wavefield synthesis is: you basically have an array of evenly spaced speakers and for each virtual sound source you drive each individual speaker with a specially delayed signal that recreates the wavefield a sound source would create if it occupied that space. This is basically as close as you can get to the thing being in actual space.

Of course the amount of delay lines and processing needed is exhorbitant and for a screen the limiting factor is the physical dimension of the thing, but if you can create high resulation 2D loudspeaker arrays that glow, you can also create ones that do not.

Because audio runs at such low sample rates, it's not exorbitant by current standards. Suppose you have an 8×16 array of speakers behind your screen, each running at 96ksps. That's only 12.3 million samples per second to generate, on the order of 200 million multiply-accumulates per second for even the most extreme scenarios. Lattice's iCE40UP5K FPGA https://www.farnell.com/datasheets/3215488.pdf#page=10 contains 8 "sysDSP" blocks which can do two 8-bit multiply-accumulates per clock cycle at 50MHz even when pipelining is disabled, so 800 million per second. It's 2.1 by 2.5 mm and costs US$5 at Digi-Key. I'm not familiar with AVX, but I believe your four-core 2GHz AVX512 CPU can do 256 8-bit multiply-accumulates per cycle, five hundred thousand million per second, so we're talking about an exorbitant amount of computation that's 0.04% of your CPU.

I wasn't thinking of a 16x8 array, I was thinking of an 160x80 array. Spacing your speakers too closely will have diminishing returns, but it depends on the frequency to which you want to operate. If we assume a frequency of 20kHz you should space your speakers at half the wavelength to avoid spatial aliasing artifacts, so something like a speaker every 8mm. This is especially important if the listener position is close.

This means 13000 precise delay lines multiplied with the number of virtual sound sources you want to allow for at the same time let's assume 64. At a sampling rate of 48kHz that means 39x10⁹ Samples per second. That isn't nothing, especially for a consumer device and if we assume the delay values for each of the virtual source-speaker combinations needs to be adjusted on the fly.

Hmm, I see. I think that you can cheat quite a bit more than that, though, if your objective is only to fool human hearing (as your 48ksps and 20kHz numbers suggest): the humans can only use phase information to detect the directionality of sound up to a few hundred Hz, relying entirely on amplitude attenuation above that, presumably because their neurons run too slow. But maybe your objective is sonar beamforming or something.