Curious if anyone else had the same reaction as me
This model is specifically trained on this task and significantly[1] underperforms opus.
Opus costs about 6x more.
Which seems... totally worth it based on the task at hand.
[1]: based on the total spread of tested models
Agreed. The idea is nice and honorable. At the same time, if AI has been proving one thing, it's that quality usually reigns over control and trust (except for some sensitive sectors and applications). Of course it's less capital-intense, so makes sense for a comparably little EU startup to focus on that niche. Likely won't spin the top line needle much, though, for the reasons stated.
> quality usually reigns over control and trust
Most Copilot customers use Copilot because Microsoft has been able to pinky promise some level of control for their sensitive data. That's why many don't get to use Claude or Codex or Mistral directly at work and instead are forced through their lobotomised Copilot flavours.
Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.
Ha, keep putting your prompts and workflows into cloud models. They are not okay with being a platform, they intend to cannibalize all businesses. Quality doesn't always reign over control and trust. Your data and original ideas are your edge and moat.
Treating "quality" as something you can reliably measure in AI proof tools sounds nice until you try auditing model drift after the 14th update and realize the "trust" angle stops being a niche preference and starts looking like the whole product. Brand is not a proof. Plenty of orgs will trade peak output for auditability, even if the market is bigger for YOLO feature churn.
EU could help them very much if they would start enforcing the Laws, so that no US Company can process European data, due to the Americans not willing to budge on Cloud Act.
That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)
This would be unfortunately a rather nuclear option due to the continent’s insane reliance on technology that breaks its unenforced laws.
How about not making these unenforced laws in the first place so that European companies could actually have a chance at competing? We're going to suffer the externalities of AI either way, but at least there would be a chance that a European company could be relevant.
The AI Act absolutely befuddled me. How could you release relatively strict regulation for a technology that isn't really being used yet and is in the early stages of development? How did they not foresee this kneecapping AI investment and development in Europe? If I were a tinfoil hat wearer I'd probably say that this was intentional sabotage, because this was such an obvious consequence.
Mistral is great, but they haven't kept up with Qwen (at least with Mistral Small 4). Leanstral seems interesting, so we'll have to see how it does.
Because the AI act was mostly written to address issues with ML products and services. It was mostly done before ChatGPT happened, so all the foundation model stuff got shoehorned in.
Speaking as someone who's been doing stats and ML for a while now, the AI act is pretty good. The compliance burden falls mostly on the companies big enough to handle it.
The foundation model parts are stupid though.
>Because the AI act was mostly written to address issues with ML products and services. It was mostly done before ChatGPT happened, so all the foundation model stuff got shoehorned in.
It's not an excuse. Anybody with half a working brain should've been able to tell that this was going to happen. You can't regulate a field in its infancy and expect it to ever function.
>The compliance burden falls mostly on the companies big enough to handle it.
You mean it falls on anyone that tries to compete with a model. There's a random 10^25 FLOPS compute rule in there. The B300 does 2500-3750 TFLOPS at fp16. 200 of these can hit that compute number in 6 months, which means that in a few years time pretty much every model is going to hit that.
And if somebody figures out fp8 training then it would only take 10 of these GPUs to hit it in 6 months.
The copyright rule and having to disclose what was trained on also means that it will be impossible to have enough training data for an EU model. And this even applies to people that make the model free and open weights.
I don't see how it is possible for any European AI model to compete. Even if these restrictions were lifted it would still push away investors because of the increased risk of stupid regulation.
Alignment tax directly eats to model quality, double digit percents.
[dead]
[dead]
[dead]
I'm never sure how much faith one can put into such benchmarks but in any case the optics seem to shift once you have pass@2 and pass@3.
Still, the more interesting comparison would be against something such as Codex.
But you can run this model for free on a common battery powered laptop sitting on your laps without cooking your legs.
Sorry, but what are you talking about? This is a 120B-A6B model, which isn't runnable on any laptop except the most beefed up Macbooks, and then will certainly drain its battery and cook your legs.
Yeah my bad, it requires an expensive MacBook.
I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...
But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.
You can easily run a quant of this on a DGX Spark though. Seems like a small investment if it meaningful improves Lean productivity.
Is it though?
Most people I know that use agents for building software and tried to switch to local development, every single time they switch back to Claude/codex.
It's just not worth it. The models are that much better and continue to get released / improve.
And it's much cheaper unless you're doing like 24/7 stuff.
Even on the $200/m plan, that's cheaper than buying a $3k dgx or $5k m4 max with enough ram.
Not to mention you can no longer use your laptop as a laptop as the power draw drains it - you'd need to host separately and connect
A single DGX Spark can service a whole department of mathematicians (or programmers), and you can cluster up to 4 of them them to fit very large models like GLM-5 and quants of Kimi K2.5. This is nearing frontier-level model size.
I understand the value proposition of the frontier cloud models, but we're not as far off from self-hosting as you think, and it's becoming more viable for domain-specific models.
That's great news- I wonder if that will help drive cloud costs down too
the model is open source, you can run it locally. You don't think thats significant ?