Modern systems like Nano Banana 2 and ChatGPT Images 2.0 are very close to "just use Photoshop directly" in concept, if not in execution.
They seem to use an agentic LLM with image inputs and outputs to produce, verify, refine and compose visual artifacts. Those operations appear to be learned functions, however, not an external tool like Photoshop.
This allows for "variable depth" in practice. Composition uses previous images, which may have been generated from scratch, or from previous images.