The real killer is that humans don't hear frequencies, they hear instruments, which are a stack of frequencies that roughly sometimes correlate with a frequency range.
I wonder if transformer tech is close to achieving real-time audio decoding, where you can split a track into it's component instruments, and light show off of that. Think those fancy Christmas time front yard light shows as opposed to random colors kind of blinking with what maybe is a beat.
real time audio stem separation is already possible, some specific models can even get around 20ms latency (HS-TasNet) https://github.com/lucidrains/HS-TasNet
There was a nice paper with an overview last year too https://arxiv.org/html/2511.13146v1 that introduced RT-STT which is still being tweaked and built upon in the MSS scene
The high quality ones like MDXNet and Demucs usually have at least several seconds of latency though, but for something like displaying visuals high quality is not really needed and the real time approaches should be fine.
I'm pretty sure it should be possible to distill HS-TasNet into a version approximate and fast enough for the purpose of animating LEDs.
At the end it's "just" chunking streamed audio into windows and predicting which LEDs a window should activate. One can build a complex non-realtime pipeline, generate high-quality training data with it, and then train a much smaller model (maybe even an MLP) with it to predict just this task.