Hacker News

It's a surprising result, and a lot of it stems from the Pro variant struggling with our custom harness in agentic tasks (whereas Flash does fine), as well as provider instability. Failed requests are not counted against the model in its score, but it's possible there are additional silent degradations even on successful requests.

Either that, or Flash is truly a better architecture and the Pro variant is heavily benchmaxxed. It wouldn't be the first time we saw something like that in our benchmarking. We collect samples every week so it'll be interesting to see if it rebalances over time as new providers host the model. Flash is great though; it's so fast and cheap.