> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
I miss GANs. I understand that they are much harder to train than transformers for the same performance even with high data regime and high parameter regime, but there was such good optimization research and tricks that came out of them.
The work on the capacity of discriminators was super cool.
Though the main reason diffusion took over was for the ability to encode more diversity. I still think there's a place for GANs and we overcorrected by putting too much focus on diffusion, but there are a lot of fundamental advantages to diffusion. Though they aren't strictly better, there's no global optima for solution spaces this large. I think the ML community (maybe CS in general) has a tendency to take an all or nothing approach. I don't think this is a really good strategy...
thanks! got any links if you can spare the time? i think the info on gans using transformers might be enough. wasn't aware!
Sure. This was the paper[0]. Here's a few more you might find these interesting. Google's Transformer GAN[1] (not a transformer at all resolutions). Diffusion-GAN[2] is a hybrid architecture. Remember that technically the GAN process can use any underlying architecture. Arguably you could say some of the training steps in LLMs are GANs. And I think this one is also interesting in a similar respect[3]. Before PWC went down, StyleSAN[4] was the SOTA on FFHQ, but IIRC this doesn't change the architecture so it should probably work on all the other architectures too (comes with compute costs, but I think only training. It's been a bit since I read it)
[0] https://arxiv.org/abs/2211.05770
[1] https://arxiv.org/abs/2106.07631
[2] https://arxiv.org/abs/2206.02262
[3] https://arxiv.org/abs/2212.04473
[4] https://arxiv.org/abs/2301.12811
Thank you for spending your time to answer my questions.
> This seems to go beyond just transformers
And beyond that. Sometimes I feel like AI research is reinventing wheels that exist elsewhere. Maybe just the wheels, but still.