I don’t see how you can make these claims without having your own evals and running these models yourself. The gpt-oss results i’m getting for my use case, which is agentic task execution for a wide variety of tasks on my local device are spectacular, even more so when you stack them up against every model in the 20B weight class.
That's what I've been feeling too. But it is just a feeling. I'm not running any benchmarks.
My agentic coding "app" (basically just a tool "server" around dotnet/git/fs commands with a kanban board) seems to be able to spit out quick SPAs with little additional prompting.