A lot of wifi sensing results that have high-dimensional outputs are usually using wideband links... your average wifi connection uses 20MHz of bandwidth and is transmitting on 48 spaced out frequencies. In the paper, we use 160MHz with effectively 1992 input data points. This still isn't enough to predict a 3x512x512 image well enough, which motivated predicting 4x64x64 latent embeddings instead.

The more space you take up in the frequency domain, the higher your resolution in the time domain is. Wifi sensing results that detect heart rate or breathing, for example, use even larger bandwidth, to the point where it'd be more accurate to call them radars than wifi access points.