CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks
ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image
CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks
ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image