A couple of things:
I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.
As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.
Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.
That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.
Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.
Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.
So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.
No, that's not actually a good description of the mixture-of-experts methodology. It was poorly named. There is no conscious division of the weights into "This subset is good for poetry, this one is best for programming, this one for math, this one for games, this one for language translation, etc."
> I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.
But how is it different from what arena or matharena does?
> That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.
The claim is that these problems require somewhat broad intelligence by themselves, as opposed to specialization into specific task while unable to do anything else.