> Nothing has reached Opus and GPT5 levels in my personal experience
You mean, GPT 5.5 xhigh and Claude Opus 4.8 max? At least the benchmarks / public evals / rankings show some of the new coding models (ex: Qwen 3.7 Max & Mimo v2.5 Pro) are Opus 4.7 & GPT 5.4 level (but 3x to 5x cheaper): https://artificialanalysis.ai/leaderboards/models / https://gertlabs.com/rankings Personally speaking, in the past 1mo or so, I haven't missed GPT 5.4 / Opus 4.7 after moving to Qwen 3.7 / MiMo 2.5 / Kimi 2.6 et al.
That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher reasoning budget can make up for weaker per-token performance? That is indeed worth evaluating.
Comparisons using the vendor-specific effort is apples and oranges. Ideally the evals would use a thinking token cap or something, so we can compare per-token performance. But eval is hard enough as it is.