Think of it as an img2img stable diffusion process, except instead of starting with an image you want to transform, you start with CSI.

The encoder itself is trained on latent embeddings of images in the same environment with the same subject, so it learns visual details (that are preserved through the original autoencoder; this is why the model can't overfit on, say, text or faces).