The only real world task benchmark I know of is Scale Labs RLI
https://labs.scale.com/leaderboard/rli
Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.
Missing some recent models on that list, but I think most crucially, the harness is fixed —- one of the major learnings of the last few months is that harness and eval (“looping” and support / tooling around it) is really critical. I would guess these numbers are the floor.
For instance, some of these tasks include creating videos, and one of the common reported failure mode is truncated videos, or not all videos being created. This sort of failure mode is currently best managed by an outer evaluation loop; no frontier model will, when managed by an eval loop, submit work like this right now.
> these models are useless on any real world task
I beg to differ. They are not perfect but immensively useful today.