I might be mistaking, but I don't see how this is novel. As far as I know, this has a proven DSP technique for ages, although it it usually only applied when a small amount of distinct frequencies need to be detected - for example DTMF.

When the number of frequencies/bins grows, it is computationally much cheaper to use the well known FFT algorithm instead, at the price of needing to handle input data by blocks instead of "streaming".

The difference from FFT is this is a multiresolution technique, like the constant-Q transform. And, unlike CQT (which is noncausal), this provides a better match to the actual behavior of our ears (by being causal). It's also "fast" in the sense of FFT (which CQT is not).

There exists the multiresolution FFT, and other forms of FFT which are based around sliding windows/SFFT techniques. CQT can also be implemented extremely quickly, utilising FFT's and kernels or other methods, like in the librosa library (dubbed pseudo-CQT).

I'm also not sure how this is causal? It has a weighted-time window (biasing the more recent sound), which is farily novel, but I wouldn't call that causal.

This is not to say I don't think this is cool, it certainly looks better than existing techniques like synchrosqueezing for pushing the limit of the heisenberg uncertainty principle (technically given ideal conditions synchrosqueezing can outperform the principle, but only a specific subset of signals).