So here is my understanding of current native image generation scenario, I might be wrong so please correct me, I'm still learning it and I'd appreaciate the help.

First time native image gen was introduced in Gemini 1.5 Flash if I'm not wrong, and then OpenAI was released for 4o which took over the internet by Ghibli Art.

We have been getting good quality images from almost all image generators like Midjourney, OpenAI and other providers, but the thing that made it special was true "multimodal" nature of it. Here's what I mean

When you used to ask chatgpt to create an image, it will rephrase that prompt and internally send that prompt to Dalle, similarly gemini would send it to Imagen which were diffusion models and they had little to know context in your next response about what's there in the previous image

In native image generation, it understands Audio, Text and even Image tokens in the same model and need not to rely on diffusion models internally, I don't think both Openai and google has released how they've trained it but my guess is that it's partially auto-regressive and diffusion but not sure about it

I think actually 4o image generation in ChatGPT is still a tool call with a prompt to an “image_gen” tool, I don’t think the generator receives the full context of the conversation. If you do a ChatGPT data export and inspect the record of a conversation using 4o image gen, you’ll see it’s a tool call with a distinct prompt, much like it was with dalle. And if you pass an image in as context, it’ll pass that to the tool as well.

I imagine this is for anti-jailbreak moderation reasons, which is understandable

This is not fully correct.

The people behind flux are the authors of stable diffusion paper that dates back to 2022.

Openai initially had dallee but stable diffusion was a massive improvement on dallee.

Then openai inspired itself from stable diffusion for gpt image