My personal observation (using a mix of opencode and pi harness):
1. DS4Pro: around opus 4.5
2. DS4Flash: around sonnet 4
3. Mimo v2.5 pro: between opus 4.5 and opus 4.6.
4. minimax M3: around opus 4.6
All of these are very close in terms of quality and pricing. For anything that is not specifically related to coding, DS4Flash has become ny de-factor model. It just works... super fast, tool calling is perfect, and the price is unbeatable. Caching is out of the world. Im now regularly hitting 90%+.
i have been using deepseek-v4-flash since it came out. i use a highly structured harness and spec/test driven workflow running through opencode, and so far there has been nothing it can't do.
i have run through a bunch of tests: re-writing vvenc with assembly kernels, creating the first generation agent harness integration with opencode, porting TS npm modules to C++, porting an entire TS server app to C++, creating a new pure io_uring http server with zero-copy (325K RPS single core), creating a second generation agent from the ground up in C++, setting up a dev environment for custom kernel development on tenstorrent accelerators using tt-metal and ttsim.
i consistently get 98.5% input cache hit ratio. i do see noticeable degradation in performance in the 400-500K context range, so i always try to wrap up sessions by 500K max.
a non-intuitive thing is that the model is very good at low-level systems engineering. i suspect this is because they are internally using it to port their stack to huawei hardware. it can churn out exceptionally complex low level C++ stuff that blows your mind, and then completely choke and run in circles on other seemingly simple tasks.
i only use flash and not pro because i want my tooling to be portable to open weights models that are practical to run. i use deepseek platform and not the open weights models for development, because it is subsidized, and based on observation, i think it is highly likely that they are running some proprietary features on the platform which are not in the open weights model.
it will be very interesting to see what their next point release looks like. the compounding effect of optimizing inference cost and then feeding back inference into training should lead to rapid and accelerating improvement, but only time will tell.
Thanks for the details. What's a second generation agent?
You mentioned the workflow is heavy on specs and tests. The smaller models seem to be really good at following instructions now. (Well, some of them!)
So that's probably part of why you're seeing good results. It has a very clear target.
Whereas with more open ended instructions they seem to struggle more. I think common sense is the main thing you get with model size.
When I'm working with the big models I feel like I don't have to spell things out so much. The gap is closing, but I'm assuming there is some fundamental limit there based on the size.
Of course the ideal would be Mythos, running for free, in my house, at 1,000 tok/s ;) Someday...
Thank you a lot for such an insightful comment. The low level stuff part, including porting entire codebases using DV4Flash came as a genuine surprise to me. I did not expected it to be this good.
When you say "i use a highly structured harness" ... can you please tell me what is it exactly?
https://github.com/opensassi/opencode
Thanks..
I always feel GPT5.5 is better at ‘getting the bigger picture‘ when I am describing something vaguely vs Chinese models. What’s your experience with that?
That's true. The open models still do not match these extreme high end models yet on very high levels of understanding.
But that's also not needed in most of the times. There will always be a "better" model... but that doesn't make other models "bad".
For my use-cases, open models are now almost on par with these top models... and it's only extremely rare that I genuinely "need" the help of top-of-the line closed models.