So you're going to use DeepSeek, Qwen, GLM, Kimi and Mistral now? I tried them, and they really fall short of GPT and Claude.
Without access to US models, I'd be limited to asking simple questions in chat interfaces and maybe some grunt work in coding CLIs, but even that the weak models will mess up.
Nothing has reached Opus and GPT5 levels in my personal experience, which also aligns with what the labs themselves admit ("near-frontier").
Well I am definitely not using the models that I'm not able to access.
So now the question is whether the capabilities of other models are worth their far cheaper token prices.
Plus, are we at all confident Opus or GPT 5.5 aren't about to get shut off?
Not all people need the SOTA. Also, many take into consideration speed, token / plan cost and many other factors when choosing a model
> Nothing has reached Opus and GPT5 levels in my personal experience
You mean, GPT 5.5 xhigh and Claude Opus 4.8 max? At least the benchmarks / public evals / rankings show some of the new coding models (ex: Qwen 3.7 Max & Mimo v2.5 Pro) are Opus 4.7 & GPT 5.4 level (but 3x to 5x cheaper): https://artificialanalysis.ai/leaderboards/models / https://gertlabs.com/rankings Personally speaking, in the past 1mo or so, I haven't missed GPT 5.4 / Opus 4.7 after moving to Qwen 3.7 / MiMo 2.5 / Kimi 2.6 et al.
That is very promising news. I will re-eval them all shortly. And you are suggesting that a higher reasoning budget can make up for weaker per-token performance? That is indeed worth evaluating.
Comparisons using the vendor-specific effort is apples and oranges. Ideally the evals would use a thinking token cap or something, so we can compare per-token performance. But eval is hard enough as it is.
I have been using DeepSeek at home. I have access to Claude and ChatGPT at work.
I honestly think that DeepSeek is as good, and sometimes even better, than the competition.