If it sees the shape of a fire extinguisher, the diffusion model will "know" it should be red. But that's not all that's going on here. Hair color etc seems impossible to guess, right? To be fair I haven't actually read the paper so maybe they explain this
downvoted until you read the paper