That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.
That sounds incredibly disappointing given how high their benchmarks are, indicating they might be overtuned for those, similar to Llama4.
Yeah, I think so too. They seemed to be better at specific tasks, but worse overall, at broader tasks.