FYI the images are not generated based on the WiFi data. The WiFi data is used as additional conditioning for a regular diffusion image generation model. So what that means is the WiFi measurements are used for determining which objects to place where in the image, but the diffusion model will then fill in any "knowledge gaps" with randomly generated (but visually plausible) data.

I'm confused about how it gets things like the floor colour and clothing colour correct.

It seems like they might be giving it more information besides the WiFi data, or else maybe training it on photos of the actual person in the actual room, in which case it's not obvious how well it would generalise.

> I'm confused about how it gets things like the floor colour and clothing colour correct.

The model was trained on the room.

It would produce images of the room even without any WiFi data input at all.

The WiFi is used as a modulator on the input to the pre trained model.

It’s not actually generating an image of the room from only WiFi signals.

This is what GP eludes to, the original dataset has many similar reference images (i.e. the common mode is the same), and the LatentCSI model is tasked to reconstruct the correct specific instance (or a similarly plausible image in case of the test/validation set)

It wouldn't generalize at all. The Wi-Fi is just differentiating among a small set of possible object placement/orientations within that fixed space, then modifying photos taken appropriately, as far as I can tell.

[dead]

Think of it as an img2img stable diffusion process, except instead of starting with an image you want to transform, you start with CSI.

The encoder itself is trained on latent embeddings of images in the same environment with the same subject, so it learns visual details (that are preserved through the original autoencoder; this is why the model can't overfit on, say, text or faces).