Something is odd with this model, their blog posts shows REALLY good results, but in most other third-party benchmarks, people realize it's not really SOTA, even bellow Kimi K2.6 and GLM-5/5.1
In my tests too[0], it doesn't reach top 10. One issue, which they also mentioned in their post, is that they can't really serve well the model at the moment, so V4-Pro is heavily rate-limited and gives a lot of timeout errors when I try to test it. This shouldn't be an issue though, considering the model is open-source, but it makes it hard to accurately test at the moment.
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
> V4-Pro is heavily rate-limited and gives a lot of timeout errors when I try to test it. This shouldn't be an issue though, considering the model is open-source
Why does it matter if the model/architecture/weights are open source or not, given it's their proprietary inference hardware they're currently having issues with? Proprietary or not, the same issue would still be there on their platform.
I used pro via API (DeepSeek API not OpenRouter) with Claude Code, and the planning, visual solution, understanding was fantastic.
I would say I wouldn't notice this wasn't Opus 4.6. What I asked was looking at a feature implemented recently, and how it could be improved. Consumed 3.3 million tokens and create a much better flow.
It had a bug when I started the implementation though related to the API, which I suppose it is something they didn't catch when making their API compatible with CC.
Your “benchmark” is invalid. Penalizing the model because the hosting environment is being DDoSed by users a few hours after launch is utter nonsense.
I see that you tried to justify this lower in the thread, but no… it completely invalidates your benchmark. You are not testing the model. You are conflating one specific model host and model performance, and then claiming you are benchmarking the model. All major models are hosted by multiple different services.
In the real world, clients will just retry if there is a server error, and that will not impact response quality at all, and the workflow the model is being used in will not fail. If a workflow is so poorly coded that it doesn’t even have retry logic, then that workflow is doomed no matter which host you use. But again, reliability of the host is separate from the model.
You can make your benchmark valid by having separate leaderboards for model quality and host reliability. I’m not saying to throw the whole thing away. But the current claim is not valid.
And you’re also making an unsourced claim that everyone else has already determined this model sucks? Nah. The first result from Artificial Analysis shows good things: https://x.com/ArtificialAnlys/status/2047547434809880611
But I am still waiting to see the results from the full suite of AA benchmarks.
Hmm, the Flash performs significantly better than Pro in the benchmark? That's very strange; could rate limiting cause that?
Yes, Flash doesn't seem to have the same rate limits as Pro.
I expect once the API issues are fixed, for v4-pro to be around the same level as GLM-5.
Why would your test be including scores of failed responses/runs? That seems confusing.
(I am confused by the results your website is presenting)
Because the idea of those benchmarks is to see how well a model performs in real-world scenarios, as most models are served via APIs, not self-hosted.
So, for example, hypothetically if GPT-5.5 was super intelligent, but using it via API would fail 50% of the times, then using it in a real-life scenarios would make your workflows fail a lot more often than using a "dumber", but more stable model.
My plan is to also re-test models over-time, so this should account for infrastructure improvements and also to test for model "nerfing".
I take some issue with that testing methodology. It seems to me that you're conflating the model's performance with the reliability of whatever provider you're using to run the benchmark.
Many models, especially open weight ones, are served by a variety of providers in their lifetime. Each provider has their own reliability statistics which can vary throughout a model's lifetime, as well as day to day and hour to hour.
Not to mention that there are plenty of gateways that track provider uptime and can intelligently route to the one most likely to complete your request.
@seanw265 Yes, that's a problem. This can be solved for open-source models by running them on my own, but again the TPS will be dependent on the hardware used.
All models are tested through OpenRouter. The providers on OpenRouter vary drastically in quality, to the point where some simply serve broken models.
That being said, I usually test models a few hours after release, at which point, the only provider is the "official" one (e.g. Deepseek for their models, Alibaba for their own, etc.).
I don't really have any good solution for testing model reliability for closed-source models, BUT the outcome still holds: a model/provider that is more reliable, is statistically more likely to also give better results during at any given time.
A solution would be to regularly test models (e.g. every week), but I don't have the budget for that, as this is a hobby project for now.
@danyw, we reached max comment thread depth
Yes, I would. Currently I don't have that many tests (~20), and by default a test "run" includes 3 executions of each test. So, "bad luck" is already sort of solved in each run, by running each test 3 times.
Wouldn't you need to re-run across lots of samples (even for a single eval/bench) to avoid outsized impacts from just bad luck?