This is largely guesswork but I think whats happening is this. The training set contains images of a small number of rooms taken from specific camera angles with only that individual standing in it, and associated wifi signal data. The model then learns to predict the posture of the individual given the wifi signal data, outputting the prediction as a colour image. Given that the background doesn't vary across images, the model learns to predict it consistently with accurate colors etc.
The interesting part of the whole setup is that the wifi signal seems to contain the information required to predict the posture of the individual to a reasonably high degree of accuracy, which is actually pretty cool.