Quite honestly, this is the most interesting and useful thing that I have ever read, directly responsive to the question of "how good are LLMS at doing difficult tasks, in terms of both bang-for-the-buck and in terms of raw performance?"

My hat's off to swelljoe.

This part was especially interesting:

> The cheap Chinese models kick ass. MiMo and DeepSeek are directly competitive with Opus 4.8 and GPT 5.5 at roughly an order of magnitude lower price. There have been accusations of “benchmaxxing” with the Chinese models, but I don’t think there’s any reasonable way for the models to already be tuned for these very recently disclosed bugs. I think they’re genuinely becoming competitive with the frontier from Anthropic and OpenAI. If you’re in a hurry, DeepSeek was the fastest, on average, while finding 4/9 bugs. And, if you’re cheap, MiMo found bugs as well as any model for the lowest price.