Hacker News

  > much harder to train than transformers

There's plenty of GANs that use transformers. PWC seems to be redirecting to GitHub currently but IIRC about half of top scores on FFHQ256 were GANs with transformers in them. I know that the number 2 was, I saw it at CVPR. It was a lot smaller and had higher throughput than the diffusion models it was outperforming.

Though the main reason diffusion took over was for the ability to encode more diversity. I still think there's a place for GANs and we overcorrected by putting too much focus on diffusion, but there are a lot of fundamental advantages to diffusion. Though they aren't strictly better, there's no global optima for solution spaces this large. I think the ML community (maybe CS in general) has a tendency to take an all or nothing approach. I don't think this is a really good strategy...

Sure. This was the paper[0]. Here's a few more you might find these interesting. Google's Transformer GAN[1] (not a transformer at all resolutions). Diffusion-GAN[2] is a hybrid architecture. Remember that technically the GAN process can use any underlying architecture. Arguably you could say some of the training steps in LLMs are GANs. And I think this one is also interesting in a similar respect[3]. Before PWC went down, StyleSAN[4] was the SOTA on FFHQ, but IIRC this doesn't change the architecture so it should probably work on all the other architectures too (comes with compute costs, but I think only training. It's been a bit since I read it)

[0] https://arxiv.org/abs/2211.05770

[1] https://arxiv.org/abs/2106.07631

[2] https://arxiv.org/abs/2206.02262

[3] https://arxiv.org/abs/2212.04473

[4] https://arxiv.org/abs/2301.12811