It's so weird to me that the benchmarks remain so low, but the models are marketed as revolutionary. And if you say that low coding capabilities aren't a problem, say that to the token price hike and 'general use' model setup.

Why not sell it as a math agent? Why do I have to set up 4 agents to check each others' work?

from what I understand, it's because unlike the other models, MAI models haven't yet fine-tuned against the synthetic datasets specifically designed to boost the benchmark scores.

It’s about bang for buck. That high a score for 5B params is pretty good, nigh unbelievable a short while ago.

It is my belief that smaller models will get better and better, and even cloud SOTA models will shrink.

Yet another reason the current buildout will feel like the railroads.

It's 5B active params in MoE, not 5B total params (total is 137B).

> It’s about bang for buck.

Hard to know when they don't give the price per token. Presumably it will be comparable to a low-mid range model in terms of price. But otherwise their 'Ideal Zone' is meaningless without factoring in the price per token. I don't how much tokens are being used, that's an implementation detail to me. I care about price / performance / latency.

https://docs.github.com/en/copilot/reference/copilot-billing...

Model Input Cached input Output MAI-Code-1-Flash $0.75 $0.075 $4.50

Yeah the future is probably a number of highly specialised small models you can run on your own hardware rather than massive frontier models in the cloud.

That's what I'm betting on anyway.

That seems to be what Microsoft is betting on also based on what was shown at the BUILD keynote today + that new surface ultra and the surface mini PC with the new Nvidia chip. Nadella really played up local AI as the main use case they have in mind.

MOE basically work that way already, QWEN/etc with low active params (A-number in name) allows to inference big models locally (only active params have to fit into memory)

Step 3.7 Flash on my Asus GB10 based mini pc is incredibly close to that today. I’m very impressed, and that’s without MTP to boost performance

The SOTA models will not shrink, because the problems will get bigger, from "write me a C compiler" to "clone Stripe business and run it".

There will always be tasks that are withing reach of whatever the SOTA models are, but not of the cheaper, perhaps locally runnable ones. It seems that already people are finding Qwen 3.6 27B sufficient for many coding tasks (the llama.cpp author is now using it exclusively).

As models get better and smaller, I expect that we will rapidly (within a year?) get to the point where SOTA models are not needed for the vast majority of coding tasks, and even today it seems many people are just using them for the planning phase.

How many people drive Ferraris vs Fords? How many people driving a Ford would, on a utilitarian basis, be any better off driving a Ferrari?

So far there seems to be mainly two high volume use cases that have been found for LLMs - coding and business flow automation, and it seems neither of these need SOTA models. I wonder if there will continue to be enough market demand for massive expensive SOTA models to make them worthwhile developing?