Hacker News

I think there are so many variables from harnesses to tasks, making it very hard to put the models to a pecking order unless one beats another in virtually every task (like in Opus vs DeepSeek).

But all in all, I don't think we disagree.