Yay. 5.4 was a frustrating model - moments of extreme intelligence (I liked it very much for code review) - but also a sort of idiocy/literalism that made it very unsuited for prompting in a vague sense. I also found its openclaw engagement wooden and frustrating. Which didn’t matter until anthropic started charging $150 a day for opus for openclaw.

Anyway - these benchmarks look really good; I’m hopeful on the qualitative stuff.