Some interesting stats here about the current landscape https://arena.ai/leaderboard/agent
Agent Arena (Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.)
Top 10, Highest rank to lowest
Claude Fable 5 (High), Claude Opus 4.8 (Thinking), GPT 5.5 (xHigh), Claude Opus 4.7 (Thinking), GPT 5.5 (High), Claude Opus 4.7, Claude Opus 4.6, GPT 5.5, GPT 5.4 (High), GLM 5.2 (Max)
Text Arena View overall rankings across various AI models in text-to-text tasks across math, coding, creative writing, and other open-ended domains.
Top 10, Highest rank to lowest
claude-fable-5, claude-opus-4-6-thinking, claude-opus-4-7-thinking, claude-opus-4-6, claude-opus-4-7, muse-spark, gemini-3.1-pro-preview, gemini-3-pro, claude-opus-4-8-thinking, gpt-5.5-high
The only real world task benchmark I know of is Scale Labs RLI
https://labs.scale.com/leaderboard/rli
Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.
Missing some recent models on that list, but I think most crucially, the harness is fixed —- one of the major learnings of the last few months is that harness and eval (“looping” and support / tooling around it) is really critical. I would guess these numbers are the floor.
For instance, some of these tasks include creating videos, and one of the common reported failure mode is truncated videos, or not all videos being created. This sort of failure mode is currently best managed by an outer evaluation loop; no frontier model will, when managed by an eval loop, submit work like this right now.
> these models are useless on any real world task
I beg to differ. They are not perfect but immensively useful today.
there is no GPT 5.6 init, so what's the point?