CNNs are fine when trained with a good recipe. There are very few good studies comparing them with proper hyperparam search and all the training tricks applied consistently. Transformers are good but ViT vs CNN is not some settled issue. Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.
>Transformers are more hyped and more popular with the tech enthusiasts who just read forums and news, but if you need stuff done, CNNs are still great.
Vits are straight up more popular for ML research now, it's not just 'tech enthusiasts'.
I agree, but since we're talking about imagine understanding with text output, clearly a CNN is unsuitable. My previous comment was overly reductive and CNNs can still be SoTA depending on your performance metrics. I spent the earlier part of my career training CNNs, and they are very pleasant to work with.