Hacker News

> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

SpaceManNabs 5 days ago [ - ]

I miss GANs. I understand that they are much harder to train than transformers for the same performance even with high data regime and high parameter regime, but there was such good optimization research and tricks that came out of them.

The work on the capacity of discriminators was super cool.

godelski 4 days ago [ - ]

  > much harder to train than transformers

There's plenty of GANs that use transformers. PWC seems to be redirecting to GitHub currently but IIRC about half of top scores on FFHQ256 were GANs with transformers in them. I know that the number 2 was, I saw it at CVPR. It was a lot smaller and had higher throughput than the diffusion models it was outperforming.

Though the main reason diffusion took over was for the ability to encode more diversity. I still think there's a place for GANs and we overcorrected by putting too much focus on diffusion, but there are a lot of fundamental advantages to diffusion. Though they aren't strictly better, there's no global optima for solution spaces this large. I think the ML community (maybe CS in general) has a tendency to take an all or nothing approach. I don't think this is a really good strategy...

SpaceManNabs 3 days ago [ - ]

thanks! got any links if you can spare the time? i think the info on gans using transformers might be enough. wasn't aware!

godelski 3 days ago [ - ]

Sure. This was the paper[0]. Here's a few more you might find these interesting. Google's Transformer GAN[1] (not a transformer at all resolutions). Diffusion-GAN[2] is a hybrid architecture. Remember that technically the GAN process can use any underlying architecture. Arguably you could say some of the training steps in LLMs are GANs. And I think this one is also interesting in a similar respect[3]. Before PWC went down, StyleSAN[4] was the SOTA on FFHQ, but IIRC this doesn't change the architecture so it should probably work on all the other architectures too (comes with compute costs, but I think only training. It's been a bit since I read it)

[0] https://arxiv.org/abs/2211.05770

[1] https://arxiv.org/abs/2106.07631

[2] https://arxiv.org/abs/2206.02262

[3] https://arxiv.org/abs/2212.04473

[4] https://arxiv.org/abs/2301.12811

SpaceManNabs 2 days ago [ - ]

Thank you for spending your time to answer my questions.

derbOac 4 days ago [ - ]

> This seems to go beyond just transformers

And beyond that. Sometimes I feel like AI research is reinventing wheels that exist elsewhere. Maybe just the wheels, but still.