I don’t believe similar scores on small bounded tasks mean models are interchangeable. I’ve found that heavy token-burning workflows are good for my productivity (letting multiple sessions run async working of different stuff). Claude ultracode is an easy example to point to, but there are tons of harnesses out there doing similar things. I find using a higher quality model matters because it affects how far it can get unattended before heading the wrong direction. I’ve tried using the cheaper/faster models and it’s a real downgrade (or completely useless). A model that’s even smarter with longer time horizon would be even better for my productivity. I don’t think we are at the ceiling for model quality or price. My employer pays a lot for my tokens but it’s still a lot less than they pay me.

I agree Anthropic faces some risk they could get commoditized, but on the other hand if things go well they could end up leading adoption into more industries. There are upside and downside scenarios. Recursive self-improvement is obviously an important unknown and could lead to winner-take-all.