Yeah this seems too insane to be true. I understand that wifi signal strength etc. is heavily impacted by the contents of a room, but even so it seems farfetched that there is enough information in its distortion to lead to these results.

A lot of wifi sensing results that have high-dimensional outputs are usually using wideband links... your average wifi connection uses 20MHz of bandwidth and is transmitting on 48 spaced out frequencies. In the paper, we use 160MHz with effectively 1992 input data points. This still isn't enough to predict a 3x512x512 image well enough, which motivated predicting 4x64x64 latent embeddings instead.

The more space you take up in the frequency domain, the higher your resolution in the time domain is. Wifi sensing results that detect heart rate or breathing, for example, use even larger bandwidth, to the point where it'd be more accurate to call them radars than wifi access points.