I was trying to get a better sense of the time cost quality matrix of these, so I threw together a quick eval of Sonnet 4.6, Mistral's dev model, and Opus 4.7 (figuring it's what you'd use if you were on Max).

The results for a function implementation and test of levenshtein distance in js are pretty similar but Mistral is 30x cheaper than Opus 4.7 and 4x faster than Sonnet 4.6.

https://5m6qnuhyde.evvl.io/

But that's not very informative.

Levenshtein distance is not only a well-understood problem, it's small, self-contained, and extremely well-represented in the training data. The kind of problem where even small/bad models can excel. The golden standard for those tasks is just "use a library" so no wonder the beefy models are expensive: you're chartering a commercial airplane to go grocery shopping.

My personal benchmarks are software engineering tasks (ideally spanning multiple packages in a monorepo) composed of many small decisions that, compounded, make or break the implementation and long-term maintainability.

There's where even frontier models struggle, which makes comparisons meaningful.

>> many small decisions

It’s making guesses not decisions, framing as decisions will lead you astray to wasted time and tokens.

It’s vaguely productive to tell them a ton of relevant info upfront attempting to minimise their need for load bearing guesses. I say vaguely because obedience is generally only around the level where it's good enough to lull you into a false sense of security, not to actually be obedient.

It’s a bit more productive to use the various loop mechanisms (hooks, /goal etc) to evaluate each end of turn against guard rails and reject with clear instruction on whats unacceptable. Obviously if you only do this without the front load of info then you’re likely to spend more tokens to reach a satisfactory end of iteration.

If I perfectly know all the guardrails I need, I don't need an LLM, only Prolog.

While you are correct that something like Antigravity 2 + Opus 4.6 can handle large scale software engineering tasks, I would argue that it is usually (but not always) better "coding agent hygiene" to work on smaller code modules and as the human in the loop be a partner, not someone who prompts and then disengages.

Breaking code up into composable chunks has worked well for me over 50+ years as a professional software developer, and I can't get away from the idea that it is still usually the way to go using agentic coding tools.

The one detail I did forget to mention is that if anyone goes with the Mistral subscription (instead of paying per-token), then the Mistral Vibe tool gives you their Medium 3.5 model by default, with a 200k token context. It will probably be enough for plenty of tasks, though there's also a noticeable difference between that and up to 1M.